INTEL Documentation

Documentation INTEL

CD ROM Annuaire d'Entreprises France prospect (avec ou sans emails) : REMISE DE 10 % Avec le code réduction AUDEN872

: matrix-matrix product, triangular matrix, double-precision complex. Sparse BLAS level 1 naming conventions are similar to those of BLAS level 1. For more information, see Naming Conventions. Fortran 95 Interface Conventions Fortran 95 interface to BLAS and Sparse BLAS Level 1 routines is implemented through wrappers that call respective FORTRAN 77 routines. This interface uses such features of Fortran 95 as assumed-shape arrays and optional arguments to provide simplified calls to BLAS and Sparse BLAS Level 1 routines with fewer parameters. 2 Intel® Math Kernel Library Reference Manual 52 NOTE For BLAS, Intel MKL offers two types of Fortran 95 interfaces: • using mkl_blas.fi only through include 'mkl_blas_subroutine.fi' statement. Such interfaces allow you to make use of the original LAPACK routines with all their arguments • using blas.f90 that includes improved interfaces. This file is used to generate the module files blas95.mod and f95_precision.mod. The module files mkl95_blas.mod and mkl95_precision.mod are also generated. See also section "Fortran 95 interfaces and wrappers to LAPACK and BLAS" of Intel® MKL User's Guide for details. The module files are used to process the FORTRAN use clauses referencing the BLAS interface: use blas95 (or an equivalent use mkl95_blas) and use f95_precision (or an equivalent use mkl95_precision). The main conventions used in Fortran 95 interface are as follows: • The names of parameters used in Fortran 95 interface are typically the same as those used for the respective generic (FORTRAN 77) interface. In rare cases formal argument names may be different. • Some input parameters such as array dimensions are not required in Fortran 95 and are skipped from the calling sequence. Array dimensions are reconstructed from the user data that must exactly follow the required array shape. • A parameter can be skipped if its value is completely defined by the presence or absence of another parameter in the calling sequence, and the restored value is the only meaningful value for the skipped parameter. • Parameters specifying the increment values incx and incy are skipped. In most cases their values are equal to 1. In Fortran 95 an increment with different value can be directly established in the corresponding parameter. • Some generic parameters are declared as optional in Fortran 95 interface and may or may not be present in the calling sequence. A parameter can be declared optional if it satisfies one of the following conditions: 1. It can take only a few possible values. The default value of such parameter typically is the first value in the list; all exceptions to this rule are explicitly stated in the routine description. 2. It has a natural default value. Optional parameters are given in square brackets in Fortran 95 call syntax. The particular rules used for reconstructing the values of omitted optional parameters are specific for each routine and are detailed in the respective "Fortran 95 Notes" subsection at the end of routine specification section. If this subsection is omitted, the Fortran 95 interface for the given routine does not differ from the corresponding FORTRAN 77 interface. Note that this interface is not implemented in the current version of Sparse BLAS Level 2 and Level 3 routines. Matrix Storage Schemes Matrix arguments of BLAS routines can use the following storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: a band matrix is stored compactly in a two-dimensional array: columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. For more information on matrix storage schemes, see Matrix Arguments in Appendix B. BLAS Level 1 Routines and Functions BLAS Level 1 includes routines and functions, which perform vector-vector operations. Table “BLAS Level 1 Routine Groups and Their Data Types” lists the BLAS Level 1 routine and function groups and the data types associated with them. BLAS and Sparse BLAS Routines 2 53 BLAS Level 1 Routine and Function Groups and Their Data Types Routine or Function Group Data Types Description ?asum s, d, sc, dz Sum of vector magnitudes (functions) ?axpy s, d, c, z Scalar-vector product (routines) ?copy s, d, c, z Copy vector (routines) ?dot s, d Dot product (functions) ?sdot sd, d Dot product with extended precision (functions) ?dotc c, z Dot product conjugated (functions) ?dotu c, z Dot product unconjugated (functions) ?nrm2 s, d, sc, dz Vector 2-norm (Euclidean norm) (functions) ?rot s, d, cs, zd Plane rotation of points (routines) ?rotg s, d, c, z Generate Givens rotation of points (routines) ?rotm s, d Modified Givens plane rotation of points (routines) ?rotmg s, d Generate modified Givens plane rotation of points (routines) ?scal s, d, c, z, cs, zd Vector-scalar product (routines) ?swap s, d, c, z Vector-vector swap (routines) i?amax s, d, c, z Index of the maximum absolute value element of a vector (functions) i?amin s, d, c, z Index of the minimum absolute value element of a vector (functions) ?cabs1 s, d Auxiliary functions, compute the absolute value of a complex number of single or double precision ?asum Computes the sum of magnitudes of the vector elements. Syntax Fortran 77: res = sasum(n, x, incx) res = scasum(n, x, incx) res = dasum(n, x, incx) res = dzasum(n, x, incx) Fortran 95: res = asum(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 2 Intel® Math Kernel Library Reference Manual 54 • C: mkl_blas.h Description The ?asum routine computes the sum of the magnitudes of elements of a real vector, or the sum of magnitudes of the real and imaginary parts of elements of a complex vector: res = |Re x(1)| + |Im x(1)| + |Re x(2)| + |Im x(2)|+ ... + |Re x(n)| + |Im x(n)|, where x is a vector with a number of elements that equals n. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for sasum DOUBLE PRECISION for dasum COMPLEX for scasum DOUBLE COMPLEX for dzasum Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for indexing vector x. Output Parameters res REAL for sasum DOUBLE PRECISION for dasum REAL for scasum DOUBLE PRECISION for dzasum Contains the sum of magnitudes of real and imaginary parts of all elements of the vector. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine asum interface are the following: x Holds the array of size n. ?axpy Computes a vector-scalar product and adds the result to a vector. Syntax Fortran 77: call saxpy(n, a, x, incx, y, incy) call daxpy(n, a, x, incx, y, incy) call caxpy(n, a, x, incx, y, incy) call zaxpy(n, a, x, incx, y, incy) Fortran 95: call axpy(x, y [,a]) BLAS and Sparse BLAS Routines 2 55 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpy routines perform a vector-vector operation defined as y := a*x + y where: a is a scalar x and y are vectors each with a number of elements that equals n. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. a REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Specifies the scalar a. x REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for saxpy DOUBLE PRECISION for daxpy COMPLEX for caxpy DOUBLE COMPLEX for zaxpy Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpy interface are the following: x Holds the array of size n. y Holds the array of size n. a The default value is 1. ?copy Copies vector to another vector. 2 Intel® Math Kernel Library Reference Manual 56 Syntax Fortran 77: call scopy(n, x, incx, y, incy) call dcopy(n, x, incx, y, incy) call ccopy(n, x, incx, y, incy) call zcopy(n, x, incx, y, incy) Fortran 95: call copy(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?copy routines perform a vector-vector operation defined as y = x, where x and y are vectors. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for scopy DOUBLE PRECISION for dcopy COMPLEX for ccopy DOUBLE COMPLEX for zcopy Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for scopy DOUBLE PRECISION for dcopy COMPLEX for ccopy DOUBLE COMPLEX for zcopy Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains a copy of the vector x if n is positive. Otherwise, parameters are unaltered. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine copy interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. BLAS and Sparse BLAS Routines 2 57 ?dot Computes a vector-vector dot product. Syntax Fortran 77: res = sdot(n, x, incx, y, incy) res = ddot(n, x, incx, y, incy) Fortran 95: res = dot(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dot routines perform a vector-vector reduction operation defined as where xi and yi are elements of vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for sdot DOUBLE PRECISION for ddot Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for sdot DOUBLE PRECISION for ddot Array, DIMENSION at least (1+(n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters res REAL for sdot DOUBLE PRECISION for ddot Contains the result of the dot product of x and y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dot interface are the following: 2 Intel® Math Kernel Library Reference Manual 58 x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?sdot Computes a vector-vector dot product with extended precision. Syntax Fortran 77: res = sdsdot(n, sb, sx, incx, sy, incy) res = dsdot(n, sx, incx, sy, incy) Fortran 95: res = sdot(sx, sy) res = sdot(sx, sy, sb) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sdot routines compute the inner product of two vectors with extended precision. Both routines use extended precision accumulation of the intermediate results, but the sdsdot routine outputs the final result in single precision, whereas the dsdot routine outputs the double precision result. The function sdsdot also adds scalar value sb to the inner product. Input Parameters n INTEGER. Specifies the number of elements in the input vectors sx and sy. sb REAL. Single precision scalar to be added to inner product (for the function sdsdot only). sx, sy REAL. Arrays, DIMENSION at least (1+(n -1)*abs(incx)) and (1+ (n-1)*abs(incy)), respectively. Contain the input single precision vectors. incx INTEGER. Specifies the increment for the elements of sx. incy INTEGER. Specifies the increment for the elements of sy. Output Parameters res REAL for sdsdot DOUBLE PRECISION for dsdot Contains the result of the dot product of sx and sy (with sb added for sdsdot), if n is positive. Otherwise, res contains sb for sdsdot and 0 for dsdot. BLAS and Sparse BLAS Routines 2 59 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sdot interface are the following: sx Holds the vector with the number of elements n. sy Holds the vector with the number of elements n. NOTE Note that scalar parameter sb is declared as a required parameter in Fortran 95 interface for the function sdot to distinguish between function flavors that output final result in different precision. ?dotc Computes a dot product of a conjugated vector with another vector. Syntax Fortran 77: res = cdotc(n, x, incx, y, incy) res = zdotc(n, x, incx, y, incy) Fortran 95: res = dotc(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotc routines perform a vector-vector operation defined as: where xi and yi are elements of vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x COMPLEX for cdotc DOUBLE COMPLEX for zdotc Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y COMPLEX for cdotc DOUBLE COMPLEX for zdotc Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. 2 Intel® Math Kernel Library Reference Manual 60 Output Parameters res COMPLEX for cdotc DOUBLE COMPLEX for zdotc Contains the result of the dot product of the conjugated x and unconjugated y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotc interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?dotu Computes a vector-vector dot product. Syntax Fortran 77: res = cdotu(n, x, incx, y, incy) res = zdotu(n, x, incx, y, incy) Fortran 95: res = dotu(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotu routines perform a vector-vector reduction operation defined as where xi and yi are elements of complex vectors x and y. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x COMPLEX for cdotu DOUBLE COMPLEX for zdotu Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y COMPLEX for cdotu DOUBLE COMPLEX for zdotu BLAS and Sparse BLAS Routines 2 61 Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters res COMPLEX for cdotu DOUBLE COMPLEX for zdotu Contains the result of the dot product of x and y, if n is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotu interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?nrm2 Computes the Euclidean norm of a vector. Syntax Fortran 77: res = snrm2(n, x, incx) res = dnrm2(n, x, incx) res = scnrm2(n, x, incx) res = dznrm2(n, x, incx) Fortran 95: res = nrm2(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?nrm2 routines perform a vector reduction operation defined as res = ||x||, where: x is a vector, res is a value containing the Euclidean norm of the elements of x. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for snrm2 2 Intel® Math Kernel Library Reference Manual 62 DOUBLE PRECISION for dnrm2 COMPLEX for scnrm2 DOUBLE COMPLEX for dznrm2 Array, DIMENSION at least (1 + (n -1)*abs (incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters res REAL for snrm2 DOUBLE PRECISION for dnrm2 REAL for scnrm2 DOUBLE PRECISION for dznrm2 Contains the Euclidean norm of the vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine nrm2 interface are the following: x Holds the vector with the number of elements n. ?rot Performs rotation of points in the plane. Syntax Fortran 77: call srot(n, x, incx, y, incy, c, s) call drot(n, x, incx, y, incy, c, s) call csrot(n, x, incx, y, incy, c, s) call zdrot(n, x, incx, y, incy, c, s) Fortran 95: call rot(x, y, c, s) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two complex vectors x and y, each vector element of these vectors is replaced as follows: x(i) = c*x(i) + s*y(i) y(i) = c*y(i) - s*x(i) Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for srot BLAS and Sparse BLAS Routines 2 63 DOUBLE PRECISION for drot COMPLEX for csrot DOUBLE COMPLEX for zdrot Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for srot DOUBLE PRECISION for drot COMPLEX for csrot DOUBLE COMPLEX for zdrot Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. c REAL for srot DOUBLE PRECISION for drot REAL for csrot DOUBLE PRECISION for zdrot A scalar. s REAL for srot DOUBLE PRECISION for drot REAL for csrot DOUBLE PRECISION for zdrot A scalar. Output Parameters x Each element is replaced by c*x + s*y. y Each element is replaced by c*y - s*x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine rot interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?rotg Computes the parameters for a Givens rotation. Syntax Fortran 77: call srotg(a, b, c, s) call drotg(a, b, c, s) call crotg(a, b, c, s) call zrotg(a, b, c, s) Fortran 95: call rotg(a, b, c, s) 2 Intel® Math Kernel Library Reference Manual 64 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given the Cartesian coordinates (a, b) of a point, these routines return the parameters c, s, r, and z associated with the Givens rotation. The parameters c and s define a unitary matrix such that: The parameter z is defined such that if |a| > |b|, z is s; otherwise if c is not 0 z is 1/c; otherwise z is 1. See a more accurate LAPACK version ?lartg. Input Parameters a REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Provides the x-coordinate of the point p. b REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Provides the y-coordinate of the point p. Output Parameters a Contains the parameter r associated with the Givens rotation. b Contains the parameter z associated with the Givens rotation. c REAL for srotg DOUBLE PRECISION for drotg REAL for crotg DOUBLE PRECISION for zrotg Contains the parameter c associated with the Givens rotation. s REAL for srotg DOUBLE PRECISION for drotg COMPLEX for crotg DOUBLE COMPLEX for zrotg Contains the parameter s associated with the Givens rotation. ?rotm Performs modified Givens rotation of points in the plane. Syntax Fortran 77: call srotm(n, x, incx, y, incy, param) BLAS and Sparse BLAS Routines 2 65 call drotm(n, x, incx, y, incy, param) Fortran 95: call rotm(x, y, param) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two vectors x and y, each vector element of these vectors is replaced as follows: for i=1 to n, where H is a modified Givens transformation matrix whose values are stored in the param(2) through param(5) array. See discussion on the param argument. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION at least (1 + (n -1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. param REAL for srotm DOUBLE PRECISION for drotm Array, DIMENSION 5. The elements of the param array are: param(1) contains a switch, flag. param(2-5) contain h11, h21, h12, and h22, respectively, the components of the array H. Depending on the values of flag, the components of H are set as follows: 2 Intel® Math Kernel Library Reference Manual 66 In the last three cases, the matrix entries of 1., -1., and 0. are assumed based on the value of flag and are not required to be set in the param vector. Output Parameters x Each element x(i) is replaced by h11*x(i) + h12*y(i). y Each element y(i) is replaced by h21*x(i) + h22*y(i). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine rotm interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. ?rotmg Computes the parameters for a modified Givens rotation. Syntax Fortran 77: call srotmg(d1, d2, x1, y1, param) call drotmg(d1, d2, x1, y1, param) Fortran 95: call rotmg(d1, d2, x1, y1, param) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given Cartesian coordinates (x1, y1) of an input vector, these routines compute the components of a modified Givens transformation matrix H that zeros the y-component of the resulting vector: BLAS and Sparse BLAS Routines 2 67 Input Parameters d1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the scaling factor for the x-coordinate of the input vector. d2 REAL for srotmg DOUBLE PRECISION for drotmg Provides the scaling factor for the y-coordinate of the input vector. x1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the x-coordinate of the input vector. y1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the y-coordinate of the input vector. Output Parameters d1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the first diagonal element of the updated matrix. d2 REAL for srotmg DOUBLE PRECISION for drotmg Provides the second diagonal element of the updated matrix. x1 REAL for srotmg DOUBLE PRECISION for drotmg Provides the x-coordinate of the rotated vector before scaling. param REAL for srotmg DOUBLE PRECISION for drotmg Array, DIMENSION 5. The elements of the param array are: param(1) contains a switch, flag. param(2-5) contain h11, h21, h12, and h22, respectively, the components of the array H. Depending on the values of flag, the components of H are set as follows: 2 Intel® Math Kernel Library Reference Manual 68 In the last three cases, the matrix entries of 1., -1., and 0. are assumed based on the value of flag and are not required to be set in the param vector. ?scal Computes the product of a vector by a scalar. Syntax Fortran 77: call sscal(n, a, x, incx) call dscal(n, a, x, incx) call cscal(n, a, x, incx) call zscal(n, a, x, incx) call csscal(n, a, x, incx) call zdscal(n, a, x, incx) Fortran 95: call scal(x, a) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?scal routines perform a vector operation defined as x = a*x where: a is a scalar, x is an n-element vector. Input Parameters n INTEGER. Specifies the number of elements in vector x. a REAL for sscal and csscal DOUBLE PRECISION for dscal and zdscal COMPLEX for cscal DOUBLE COMPLEX for zscal Specifies the scalar a. x REAL for sscal DOUBLE PRECISION for dscal COMPLEX for cscal and csscal DOUBLE COMPLEX for zscal and zdscal Array, DIMENSION at least (1 + (n -1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. BLAS and Sparse BLAS Routines 2 69 Output Parameters x Updated vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine scal interface are the following: x Holds the vector with the number of elements n. ?swap Swaps a vector with another vector. Syntax Fortran 77: call sswap(n, x, incx, y, incy) call dswap(n, x, incx, y, incy) call cswap(n, x, incx, y, incy) call zswap(n, x, incx, y, incy) Fortran 95: call swap(x, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description Given two vectors x and y, the ?swap routines return vectors y and x swapped, each replacing the other. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. x REAL for sswap DOUBLE PRECISION for dswap COMPLEX for cswap DOUBLE COMPLEX for zswap Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. y REAL for sswap DOUBLE PRECISION for dswap COMPLEX for cswap DOUBLE COMPLEX for zswap Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. 2 Intel® Math Kernel Library Reference Manual 70 Output Parameters x Contains the resultant vector x, that is, the input vector y. y Contains the resultant vector y, that is, the input vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine swap interface are the following: x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. i?amax Finds the index of the element with maximum absolute value. Syntax Fortran 77: index = isamax(n, x, incx) index = idamax(n, x, incx) index = icamax(n, x, incx) index = izamax(n, x, incx) Fortran 95: index = iamax(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description This function is declared in mkl_blas.fi for FORTRAN 77 interface, in blas.f90 for Fortran 95 interface, and in mkl_blas.h for C interface. Given a vector x, the i?amax functions return the position of the vector element x(i) that has the largest absolute value for real flavors, or the largest sum |Re(x(i))|+|Im(x(i))| for complex flavors. If n is not positive, 0 is returned. If more than one vector element is found with the same largest absolute value, the index of the first one encountered is returned. Input Parameters n INTEGER. Specifies the number of elements in vector x. x REAL for isamax DOUBLE PRECISION for idamax COMPLEX for icamax BLAS and Sparse BLAS Routines 2 71 DOUBLE COMPLEX for izamax Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters index INTEGER. Contains the position of vector element x that has the largest absolute value. Fortran 95 Interface Notes Functions and routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the function iamax interface are the following: x Holds the vector with the number of elements n. i?amin Finds the index of the element with the smallest absolute value. Syntax Fortran 77: index = isamin(n, x, incx) index = idamin(n, x, incx) index = icamin(n, x, incx) index = izamin(n, x, incx) Fortran 95: index = iamin(x) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description This function is declared in mkl_blas.fi for FORTRAN 77 interface, in blas.f90 for Fortran 95 interface, and in mkl_blas.h for C interface. Given a vector x, the i?amin functions return the position of the vector element x(i) that has the smallest absolute value for real flavors, or the smallest sum |Re(x(i))|+|Im(x(i))| for complex flavors. If n is not positive, 0 is returned. If more than one vector element is found with the same smallest absolute value, the index of the first one encountered is returned. Input Parameters n INTEGER. On entry, n specifies the number of elements in vector x. x REAL for isamin 2 Intel® Math Kernel Library Reference Manual 72 DOUBLE PRECISION for idamin COMPLEX for icamin DOUBLE COMPLEX for izamin Array, DIMENSION at least (1+(n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. Output Parameters index INTEGER. Contains the position of vector element x that has the smallest absolute value. Fortran 95 Interface Notes Functions and routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the function iamin interface are the following: x Holds the vector with the number of elements n. ?cabs1 Computes absolute value of complex number. Syntax Fortran 77: res = scabs1(z) res = dcabs1(z) Fortran 95: res = cabs1(z) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?cabs1 is an auxiliary routine for a few BLAS Level 1 routines. This routine performs an operation defined as res=|Re(z)|+|Im(z)|, where z is a scalar, and res is a value containing the absolute value of a complex number z. Input Parameters z COMPLEX scalar for scabs1. DOUBLE COMPLEX scalar for dcabs1. Output Parameters res REAL for scabs1. DOUBLE PRECISION for dcabs1. Contains the absolute value of a complex number z. BLAS and Sparse BLAS Routines 2 73 BLAS Level 2 Routines This section describes BLAS Level 2 routines, which perform matrix-vector operations. Table “BLAS Level 2 Routine Groups and Their Data Types” lists the BLAS Level 2 routine groups and the data types associated with them. BLAS Level 2 Routine Groups and Their Data Types Routine Groups Data Types Description ?gbmv s, d, c, z Matrix-vector product using a general band matrix gemv s, d, c, z Matrix-vector product using a general matrix ?ger s, d Rank-1 update of a general matrix ?gerc c, z Rank-1 update of a conjugated general matrix ?geru c, z Rank-1 update of a general matrix, unconjugated ?hbmv c, z Matrix-vector product using a Hermitian band matrix ?hemv c, z Matrix-vector product using a Hermitian matrix ?her c, z Rank-1 update of a Hermitian matrix ?her2 c, z Rank-2 update of a Hermitian matrix ?hpmv c, z Matrix-vector product using a Hermitian packed matrix ?hpr c, z Rank-1 update of a Hermitian packed matrix ?hpr2 c, z Rank-2 update of a Hermitian packed matrix ?sbmv s, d Matrix-vector product using symmetric band matrix ?spmv s, d Matrix-vector product using a symmetric packed matrix ?spr s, d Rank-1 update of a symmetric packed matrix ?spr2 s, d Rank-2 update of a symmetric packed matrix ?symv s, d Matrix-vector product using a symmetric matrix ?syr s, d Rank-1 update of a symmetric matrix ?syr2 s, d Rank-2 update of a symmetric matrix ?tbmv s, d, c, z Matrix-vector product using a triangular band matrix ?tbsv s, d, c, z Solution of a linear system of equations with a triangular band matrix ?tpmv s, d, c, z Matrix-vector product using a triangular packed matrix ?tpsv s, d, c, z Solution of a linear system of equations with a triangular packed matrix ?trmv s, d, c, z Matrix-vector product using a triangular matrix ?trsv s, d, c, z Solution of a linear system of equations with a triangular matrix 2 Intel® Math Kernel Library Reference Manual 74 ?gbmv Computes a matrix-vector product using a general band matrix Syntax Fortran 77: call sgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call dgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call cgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) call zgbmv(trans, m, n, kl, ku, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call gbmv(a, x, y [,kl] [,m] [,alpha] [,beta] [,trans]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y, or y := alpha *conjg(A')*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-n band matrix, with kl sub-diagonals and ku super-diagonals. Input Parameters trans CHARACTER*1. Specifies the operation: If trans= 'N' or 'n', then y := alpha*A*x + beta*y If trans= 'T' or 't', then y := alpha*A'*x + beta*y If trans= 'C' or 'c', then y := alpha *conjg(A')*x + beta*y m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. kl INTEGER. Specifies the number of sub-diagonals of the matrix A. The value of kl must satisfy 0 = kl. ku INTEGER. Specifies the number of super-diagonals of the matrix A. The value of ku must satisfy 0 = ku. BLAS and Sparse BLAS Routines 2 75 alpha REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Specifies the scalar alpha. a REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION (lda, n). Before entry, the leading (kl + ku + 1) by n part of the array a must contain the matrix of coefficients. This matrix must be supplied column-bycolumn, with the leading diagonal of the matrix in row (ku + 1) of the array, the first super-diagonal starting at position 2 in row ku, the first subdiagonal starting at position 1 in row (ku + 2), and so on. Elements in the array a that do not correspond to elements in the band matrix (such as the top left ku by ku triangle) are not referenced. The following program segment transfers a band matrix from conventional full matrix storage to band storage: do 20, j = 1, n k = ku + 1 - j do 10, i = max(1, j-ku), min(m, j+kl) a(k+i, j) = matrix(i,j) 10 continue 20 continue lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (kl + ku + 1). x REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)) when trans = 'N' or 'n', and at least (1 + (m - 1)*abs(incx)) otherwise. Before entry, the array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. incx must not be zero. beta REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Specifies the scalar beta. When beta is equal to zero, then y need not be set on input. y REAL for sgbmv DOUBLE PRECISION for dgbmv COMPLEX for cgbmv DOUBLE COMPLEX for zgbmv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. 2 Intel® Math Kernel Library Reference Manual 76 Output Parameters y Updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbmv interface are the following: a Holds the array a of size (kl+ku+1, n). Contains a banded matrix m*nwith kl lower diagonal and ku upper diagonal. x Holds the vector with the number of elements rx, where rx = n if trans = 'N',rx = m otherwise. y Holds the vector with the number of elements ry, where ry = m if trans = 'N',ry = n otherwise. trans Must be 'N', 'C', or 'T'. The default value is 'N'. kl If omitted, assumed kl = ku, that is, the number of lower diagonals equals the number of the upper diagonals. ku Restored as ku = lda-kl-1, where lda is the leading dimension of matrix A. m If omitted, assumed m = n, that is, a square matrix. alpha The default value is 1. beta The default value is 0. ?gemv Computes a matrix-vector product using a general matrix Syntax Fortran 77: call sgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call dgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call cgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call zgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call scgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) call dzgemv(trans, m, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call gemv(a, x, y [,alpha][,beta] [,trans]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 77 Description The ?gemv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y, or y := alpha*conjg(A')*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-n matrix. Input Parameters trans CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then y := alpha*A*x + beta*y; if trans= 'T' or 't', then y := alpha*A'*x + beta*y; if trans= 'C' or 'c', then y := alpha *conjg(A')*x + beta*y. m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Specifies the scalar alpha. a REAL for sgemv, scgemv DOUBLE PRECISION for dgemv, dzgemv COMPLEX for cgemv DOUBLE COMPLEX for zgemv Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Array, DIMENSION at least (1+(n-1)*abs(incx)) when trans = 'N' or 'n' and at least (1+(m - 1)*abs(incx)) otherwise. Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv 2 Intel® Math Kernel Library Reference Manual 78 DOUBLE COMPLEX for zgemv, dzgemv Specifies the scalar beta. When beta is set to zero, then y need not be set on input. y REAL for sgemv DOUBLE PRECISION for dgemv COMPLEX for cgemv, scgemv DOUBLE COMPLEX for zgemv, dzgemv Array, DIMENSION at least (1 +(m - 1)*abs(incy)) when trans = 'N' or 'n' and at least (1 +(n - 1)*abs(incy)) otherwise. Before entry with non-zero beta, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gemv interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements rx where rx = n if trans = 'N', rx = m otherwise. y Holds the vector with the number of elements ry where ry = m if trans = 'N', ry = n otherwise. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?ger Performs a rank-1 update of a general matrix. Syntax Fortran 77: call sger(m, n, alpha, x, incx, y, incy, a, lda) call dger(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call ger(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 79 Description The ?ger routines perform a matrix-vector operation defined as A := alpha*x*y'+ A, where: alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n general matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sger DOUBLE PRECISION for dger Specifies the scalar alpha. x REAL for sger DOUBLE PRECISION for dger Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for sger DOUBLE PRECISION for dger Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a REAL for sger DOUBLE PRECISION for dger Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ger interface are the following: a Holds the matrix A of size (m,n). 2 Intel® Math Kernel Library Reference Manual 80 x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. ?gerc Performs a rank-1 update (conjugated) of a general matrix. Syntax Fortran 77: call cgerc(m, n, alpha, x, incx, y, incy, a, lda) call zgerc(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call gerc(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gerc routines perform a matrix-vector operation defined as A := alpha*x*conjg(y') + A, where: alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgerc DOUBLE COMPLEX for zgerc Specifies the scalar alpha. x COMPLEX for cgerc DOUBLE COMPLEX for zgerc Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cgerc BLAS and Sparse BLAS Routines 2 81 DOUBLE COMPLEX for zgerc Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cgerc DOUBLE COMPLEX for zgerc Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerc interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. ?geru Performs a rank-1 update (unconjugated) of a general matrix. Syntax Fortran 77: call cgeru(m, n, alpha, x, incx, y, incy, a, lda) call zgeru(m, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call geru(a, x, y [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?geru routines perform a matrix-vector operation defined as A := alpha*x*y ' + A, where: 2 Intel® Math Kernel Library Reference Manual 82 alpha is a scalar, x is an m-element vector, y is an n-element vector, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgeru DOUBLE COMPLEX for zgeru Specifies the scalar alpha. x COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION at least (1 + (m - 1)*abs(incx)). Before entry, the incremented array x must contain the m-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cgeru DOUBLE COMPLEX for zgeru Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). Output Parameters a Overwritten by the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine geru interface are the following: a Holds the matrix A of size (m,n). x Holds the vector with the number of elements m. y Holds the vector with the number of elements n. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 83 ?hbmv Computes a matrix-vector product using a Hermitian band matrix. Syntax Fortran 77: call chbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) call zhbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call hbmv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian band matrix, with k super-diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian band matrix A is used: If uplo = 'U' or 'u', then the upper triangular part of the matrix A is used. If uplo = 'L' or 'l', then the low triangular part of the matrix A is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. Specifies the number of super-diagonals of the matrix A. The value of k must satisfy 0 = k. alpha COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Specifies the scalar alpha. a COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the Hermitian matrix. The matrix must be supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. 2 Intel® Math Kernel Library Reference Manual 84 The following program segment transfers the upper triangular part of a Hermitian band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), j a(m + i, j) = matrix(i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the Hermitian matrix, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers the lower triangular part of a Hermitian band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min( n, j + k ) a( m + i, j ) = matrix( i, j ) 10 continue 20 continue The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Specifies the scalar beta. y COMPLEX for chbmv DOUBLE COMPLEX for zhbmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbmv interface are the following: a Holds the array a of size (k+1,n). BLAS and Sparse BLAS Routines 2 85 x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?hemv Computes a matrix-vector product using a Hermitian matrix. Syntax Fortran 77: call chemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call zhemv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call hemv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hemv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chemv DOUBLE COMPLEX for zhemv Specifies the scalar alpha. a COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION (lda, n). 2 Intel® Math Kernel Library Reference Manual 86 Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chemv DOUBLE COMPLEX for zhemv Specifies the scalar beta. When beta is supplied as zero then y need not be set on input. y COMPLEX for chemv DOUBLE COMPLEX for zhemv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hemv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?her Performs a rank-1 update of a Hermitian matrix. Syntax Fortran 77: call cher(uplo, n, alpha, x, incx, a, lda) BLAS and Sparse BLAS Routines 2 87 call zher(uplo, n, alpha, x, incx, a, lda) Fortran 95: call her(a, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her routines perform a matrix-vector operation defined as A := alpha*x*conjg(x') + A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for cher DOUBLE PRECISION for zher Specifies the scalar alpha. x COMPLEX for cher DOUBLE COMPLEX for zher Array, dimension at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. a COMPLEX for cher DOUBLE COMPLEX for zher Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). 2 Intel® Math Kernel Library Reference Manual 88 Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?her2 Performs a rank-2 update of a Hermitian matrix. Syntax Fortran 77: call cher2(uplo, n, alpha, x, incx, y, incy, a, lda) call zher2(uplo, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call her2(a, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her2 routines perform a matrix-vector operation defined as A := alpha *x*conjg(y') + conjg(alpha)*y *conjg(x') + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n Hermitian matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular of the array a is used. BLAS and Sparse BLAS Routines 2 89 If uplo = 'L' or 'l', then the low triangular of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for cher2 DOUBLE COMPLEX for zher2 Specifies the scalar alpha. x COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. a COMPLEX for cher2 DOUBLE COMPLEX for zher2 Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her2 interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. 2 Intel® Math Kernel Library Reference Manual 90 ?hpmv Computes a matrix-vector product using a Hermitian packed matrix. Syntax Fortran 77: call chpmv(uplo, n, alpha, ap, x, incx, beta, y, incy) call zhpmv(uplo, n, alpha, ap, x, incx, beta, y, incy) Fortran 95: call hpmv(ap, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Specifies the scalar alpha. ap COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(2, 1) and a(3, 1) respectively, and so on. BLAS and Sparse BLAS Routines 2 91 The imaginary parts of the diagonal elements need not be set and are assumed to be zero. x COMPLEX for chpmv DOUBLE PRECISION COMPLEX for zhpmv Array, DIMENSION at least (1 +(n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Specifies the scalar beta. When beta is equal to zero then y need not be set on input. y COMPLEX for chpmv DOUBLE COMPLEX for zhpmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?hpr Performs a rank-1 update of a Hermitian packed matrix. Syntax Fortran 77: call chpr(uplo, n, alpha, x, incx, ap) call zhpr(uplo, n, alpha, x, incx, ap) Fortran 95: call hpr(ap, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi 2 Intel® Math Kernel Library Reference Manual 92 • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpr routines perform a matrix-vector operation defined as A := alpha*x*conjg(x') + A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for chpr DOUBLE PRECISION for zhpr Specifies the scalar alpha. x COMPLEX for chpr DOUBLE COMPLEX for zhpr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. incx must not be zero. ap COMPLEX for chpr DOUBLE COMPLEX for zhpr Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1, 1), ap(2) and ap(3) contain a(2, 1) and a(3, 1) respectively, and so on. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. BLAS and Sparse BLAS Routines 2 93 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpr interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?hpr2 Performs a rank-2 update of a Hermitian packed matrix. Syntax Fortran 77: call chpr2(uplo, n, alpha, x, incx, y, incy, ap) call zhpr2(uplo, n, alpha, x, incx, y, incy, ap) Fortran 95: call hpr2(ap, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hpr2 routines perform a matrix-vector operation defined as A := alpha*x*conjg(y') + conjg(alpha)*y*conjg(x') + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n Hermitian matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha COMPLEX for chpr2 2 Intel® Math Kernel Library Reference Manual 94 DOUBLE COMPLEX for zhpr2 Specifies the scalar alpha. x COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, dimension at least (1 +(n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, DIMENSION at least (1 +(n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. ap COMPLEX for chpr2 DOUBLE COMPLEX for zhpr2 Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the Hermitian matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the Hermitian matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. The imaginary parts of the diagonal elements need not be set and are assumed to be zero. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements need are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpr2 interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?sbmv Computes a matrix-vector product using a symmetric band matrix. BLAS and Sparse BLAS Routines 2 95 Syntax Fortran 77: call ssbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) call dsbmv(uplo, n, k, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call sbmv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sbmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric band matrix, with k super-diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the band matrix A is used: if uplo = 'U' or 'u' - upper triangular part; if uplo = 'L' or 'l' - low triangular part. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. Specifies the number of super-diagonals of the matrix A. The value of k must satisfy 0 = k. alpha REAL for ssbmv DOUBLE PRECISION for dsbmv Specifies the scalar alpha. a REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the symmetric matrix, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first superdiagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. The following program segment transfers the upper triangular part of a symmetric band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max( 1, j - k ), j 2 Intel® Math Kernel Library Reference Manual 96 a( m + i, j ) = matrix( i, j ) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the symmetric matrix, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers the lower triangular part of a symmetric band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min( n, j + k ) a( m + i, j ) = matrix( i, j ) 10 continue 20 continue lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for ssbmv DOUBLE PRECISION for dsbmv Specifies the scalar beta. y REAL for ssbmv DOUBLE PRECISION for dsbmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbmv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. BLAS and Sparse BLAS Routines 2 97 ?spmv Computes a matrix-vector product using a symmetric packed matrix. Syntax Fortran 77: call sspmv(uplo, n, alpha, ap, x, incx, beta, y, incy) call dspmv(uplo, n, alpha, ap, x, incx, beta, y, incy) Fortran 95: call spmv(ap, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spmv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspmv DOUBLE PRECISION for dspmv Specifies the scalar alpha. ap REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric 2 Intel® Math Kernel Library Reference Manual 98 matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. x REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for sspmv DOUBLE PRECISION for dspmv Specifies the scalar beta. When beta is supplied as zero, then y need not be set on input. y REAL for sspmv DOUBLE PRECISION for dspmv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?spr Performs a rank-1 update of a symmetric packed matrix. Syntax Fortran 77: call sspr(uplo, n, alpha, x, incx, ap) call dspr(uplo, n, alpha, x, incx, ap) Fortran 95: call spr(ap, x [,uplo] [, alpha]) BLAS and Sparse BLAS Routines 2 99 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spr routines perform a matrix-vector operation defined as a:= alpha*x*x'+ A, where: alpha is a real scalar, x is an n-element vector, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspr DOUBLE PRECISION for dspr Specifies the scalar alpha. x REAL for sspr DOUBLE PRECISION for dspr Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. ap REAL for sspr DOUBLE PRECISION for dspr Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. 2 Intel® Math Kernel Library Reference Manual 100 With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spr interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?spr2 Performs a rank-2 update of a symmetric packed matrix. Syntax Fortran 77: call sspr2(uplo, n, alpha, x, incx, y, incy, ap) call dspr2(uplo, n, alpha, x, incx, y, incy, ap) Fortran 95: call spr2(ap, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?spr2 routines perform a matrix-vector operation defined as A:= alpha*x*y'+ alpha*y*x' + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n symmetric matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the matrix A is supplied in the packed array ap. If uplo = 'U' or 'u', then the upper triangular part of the matrix A is supplied in the packed array ap . If uplo = 'L' or 'l', then the low triangular part of the matrix A is supplied in the packed array ap . BLAS and Sparse BLAS Routines 2 101 n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for sspr2 DOUBLE PRECISION for dspr2 Specifies the scalar alpha. x REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. ap REAL for sspr2 DOUBLE PRECISION for dspr2 Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular part of the symmetric matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular part of the symmetric matrix packed sequentially, column-bycolumn, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a (2,1) and a(3,1) respectively, and so on. Output Parameters ap With uplo = 'U' or 'u', overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spr2 interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?symv Computes a matrix-vector product for a symmetric matrix. 2 Intel® Math Kernel Library Reference Manual 102 Syntax Fortran 77: call ssymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) call dsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy) Fortran 95: call symv(a, x, y [,uplo][,alpha] [,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?symv routines perform a matrix-vector operation defined as y := alpha*A*x + beta*y, where: alpha and beta are scalars, x and y are n-element vectors, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssymv DOUBLE PRECISION for dsymv Specifies the scalar alpha. a REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix A and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix A and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. BLAS and Sparse BLAS Routines 2 103 incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. beta REAL for ssymv DOUBLE PRECISION for dsymv Specifies the scalar beta. When beta is supplied as zero, then y need not be set on input. y REAL for ssymv DOUBLE PRECISION for dsymv Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. Output Parameters y Overwritten by the updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine symv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. y Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?syr Performs a rank-1 update of a symmetric matrix. Syntax Fortran 77: call ssyr(uplo, n, alpha, x, incx, a, lda) call dsyr(uplo, n, alpha, x, incx, a, lda) Fortran 95: call syr(a, x [,uplo] [, alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr routines perform a matrix-vector operation defined as A := alpha*x*x' + A , 2 Intel® Math Kernel Library Reference Manual 104 where: alpha is a real scalar, x is an n-element vector, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssyr DOUBLE PRECISION for dsyr Specifies the scalar alpha. x REAL for ssyr DOUBLE PRECISION for dsyr Array, DIMENSION at least (1 + (n-1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. a REAL for ssyr DOUBLE PRECISION for dsyr Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix A and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix A and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 105 ?syr2 Performs a rank-2 update of symmetric matrix. Syntax Fortran 77: call ssyr2(uplo, n, alpha, x, incx, y, incy, a, lda) call dsyr2(uplo, n, alpha, x, incx, y, incy, a, lda) Fortran 95: call syr2(a, x, y [,uplo][,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr2 routines perform a matrix-vector operation defined as A := alpha*x*y'+ alpha*y*x' + A, where: alpha is a scalar, x and y are n-element vectors, A is an n-by-n symmetric matrix. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array a is used. If uplo = 'U' or 'u', then the upper triangular part of the array a is used. If uplo = 'L' or 'l', then the low triangular part of the array a is used. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. alpha REAL for ssyr2 DOUBLE PRECISION for dsyr2 Specifies the scalar alpha. x REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. y REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION at least (1 + (n - 1)*abs(incy)). Before entry, the incremented array y must contain the n-element vector y. incy INTEGER. Specifies the increment for the elements of y. The value of incy must not be zero. 2 Intel® Math Kernel Library Reference Manual 106 a REAL for ssyr2 DOUBLE PRECISION for dsyr2 Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr2 interface are the following: a Holds the matrix A of size (n,n). x Holds the vector x of length n. y Holds the vector y of length n. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. ?tbmv Computes a matrix-vector product using a triangular band matrix. Syntax Fortran 77: call stbmv(uplo, trans, diag, n, k, a, lda, x, incx) call dtbmv(uplo, trans, diag, n, k, a, lda, x, incx) call ctbmv(uplo, trans, diag, n, k, a, lda, x, incx) call ztbmv(uplo, trans, diag, n, k, a, lda, x, incx) Fortran 95: call tbmv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 107 Description The ?tbmv routines perform one of the matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular band matrix, with (k +1) diagonals. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is an upper or lower triangular matrix: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. On entry with uplo = 'U' or 'u', k specifies the number of super-diagonals of the matrix A. On entry with uplo = 'L' or 'l', k specifies the number of sub-diagonals of the matrix a. The value of k must satisfy 0 = k. a REAL for stbmv DOUBLE PRECISION for dtbmv COMPLEX for ctbmv DOUBLE COMPLEX for ztbmv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. The following program segment transfers an upper triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), j a(m + i, j) = matrix(i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row1 of the array, the first sub-diagonal starting at position 1 in 2 Intel® Math Kernel Library Reference Manual 108 row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers a lower triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min(n, j + k) a(m + i, j) = matrix (i, j) 10 continue 20 continue Note that when diag = 'U' or 'u', the elements of the array a corresponding to the diagonal elements of the matrix are not referenced, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for stbmv DOUBLE PRECISION for dtbmv COMPLEX for ctbmv DOUBLE COMPLEX for ztbmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbmv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tbsv Solves a system of linear equations whose coefficients are in a triangular band matrix. Syntax Fortran 77: call stbsv(uplo, trans, diag, n, k, a, lda, x, incx) call dtbsv(uplo, trans, diag, n, k, a, lda, x, incx) call ctbsv(uplo, trans, diag, n, k, a, lda, x, incx) call ztbsv(uplo, trans, diag, n, k, a, lda, x, incx) BLAS and Sparse BLAS Routines 2 109 Fortran 95: call tbsv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tbsv routines solve one of the following systems of equations: A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular band matrix, with (k + 1) diagonals. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is an upper or lower triangular matrix: if uplo = 'U' or 'u' the matrix is upper triangular; if uplo = 'L' or 'l', the matrix is low triangular. trans CHARACTER*1. Specifies the system of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then conjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. k INTEGER. On entry with uplo = 'U' or 'u', k specifies the number of super-diagonals of the matrix A. On entry with uplo = 'L' or 'l', k specifies the number of sub-diagonals of the matrix A. The value of k must satisfy 0 = k. a REAL for stbsv DOUBLE PRECISION for dtbsv COMPLEX for ctbsv DOUBLE COMPLEX for ztbsv Array, DIMENSION (lda, n). Before entry with uplo = 'U' or 'u', the leading (k + 1) by n part of the array a must contain the upper triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row (k + 1) of the array, the first super-diagonal starting at position 2 in row k, and so on. The top left k by k triangle of the array a is not referenced. 2 Intel® Math Kernel Library Reference Manual 110 The following program segment transfers an upper triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = k + 1 - j do 10, i = max(1, j - k), jl a(m + i, j) = matrix (i, j) 10 continue 20 continue Before entry with uplo = 'L' or 'l', the leading (k + 1) by n part of the array a must contain the lower triangular band part of the matrix of coefficients, supplied column-by-column, with the leading diagonal of the matrix in row 1 of the array, the first sub-diagonal starting at position 1 in row 2, and so on. The bottom right k by k triangle of the array a is not referenced. The following program segment transfers a lower triangular band matrix from conventional full matrix storage to band storage: do 20, j = 1, n m = 1 - j do 10, i = j, min(n, j + k) a(m + i, j) = matrix (i, j) 10 continue 20 continue When diag = 'U' or 'u', the elements of the array a corresponding to the diagonal elements of the matrix are not referenced, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least (k + 1). x REAL for stbsv DOUBLE PRECISION for dtbsv COMPLEX for ctbsv DOUBLE COMPLEX for ztbsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbsv interface are the following: a Holds the array a of size (k+1,n). x Holds the vector with the number of elements n. BLAS and Sparse BLAS Routines 2 111 uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tpmv Computes a matrix-vector product using a triangular packed matrix. Syntax Fortran 77: call stpmv(uplo, trans, diag, n, ap, x, incx) call dtpmv(uplo, trans, diag, n, ap, x, incx) call ctpmv(uplo, trans, diag, n, ap, x, incx) call ztpmv(uplo, trans, diag, n, ap, x, incx) Fortran 95: call tpmv(ap, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tpmv routines perform one of the matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. ap REAL for stpmv 2 Intel® Math Kernel Library Reference Manual 112 DOUBLE PRECISION for dtpmv COMPLEX for ctpmv DOUBLE COMPLEX for ztpmv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(1,2) and a(2,2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1,1), ap(2) and ap(3) contain a(2,1) and a(3,1) respectively, and so on. When diag = 'U' or 'u', the diagonal elements of a are not referenced, but are assumed to be unity. x REAL for stpmv DOUBLE PRECISION for dtpmv COMPLEX for ctpmv DOUBLE COMPLEX for ztpmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpmv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?tpsv Solves a system of linear equations whose coefficients are in a triangular packed matrix. Syntax Fortran 77: call stpsv(uplo, trans, diag, n, ap, x, incx) call dtpsv(uplo, trans, diag, n, ap, x, incx) call ctpsv(uplo, trans, diag, n, ap, x, incx) call ztpsv(uplo, trans, diag, n, ap, x, incx) BLAS and Sparse BLAS Routines 2 113 Fortran 95: call tpsv(ap, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?tpsv routines solve one of the following systems of equations A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular matrix, supplied in packed form. This routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the system of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then conjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. ap REAL for stpsv DOUBLE PRECISION for dtpsv COMPLEX for ctpsv DOUBLE COMPLEX for ztpsv Array, DIMENSION at least ((n*(n + 1))/2). Before entry with uplo = 'U' or 'u', the array ap must contain the upper triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1, +1), ap(2) and ap(3) contain a(1, 2) and a(2, 2) respectively, and so on. Before entry with uplo = 'L' or 'l', the array ap must contain the lower triangular matrix packed sequentially, column-by-column, so that ap(1) contains a(1, +1), ap(2) and ap(3) contain a(2, +1) and a(3, +1) respectively, and so on. When diag = 'U' or 'u', the diagonal elements of a are not referenced, but are assumed to be unity. x REAL for stpsv DOUBLE PRECISION for dtpsv COMPLEX for ctpsv 2 Intel® Math Kernel Library Reference Manual 114 DOUBLE COMPLEX for ztpsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpsv interface are the following: ap Holds the array ap of size (n*(n+1)/2). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?trmv Computes a matrix-vector product using a triangular matrix. Syntax Fortran 77: call strmv(uplo, trans, diag, n, a, lda, x, incx) call dtrmv(uplo, trans, diag, n, a, lda, x, incx) call ctrmv(uplo, trans, diag, n, a, lda, x, incx) call ztrmv(uplo, trans, diag, n, a, lda, x, incx) Fortran 95: call trmv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trmv routines perform one of the following matrix-vector operations defined as x := A*x, or x := A'*x, or x := conjg(A')*x, where: x is an n-element vector, A is an n-by-n unit, or non-unit, upper or lower triangular matrix. BLAS and Sparse BLAS Routines 2 115 Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then x := A*x; if trans = 'T' or 't', then x := A'*x; if trans = 'C' or 'c', then x := conjg(A')*x. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. a REAL for strmv DOUBLE PRECISION for dtrmv COMPLEX for ctrmv DOUBLE COMPLEX for ztrmv Array, DIMENSION (lda,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for strmv DOUBLE PRECISION for dtrmv COMPLEX for ctrmv DOUBLE COMPLEX for ztrmv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element vector x. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the transformed vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trmv interface are the following: a Holds the matrix A of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. 2 Intel® Math Kernel Library Reference Manual 116 The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. ?trsv Solves a system of linear equations whose coefficients are in a triangular matrix. Syntax Fortran 77: call strsv(uplo, trans, diag, n, a, lda, x, incx) call dtrsv(uplo, trans, diag, n, a, lda, x, incx) call ctrsv(uplo, trans, diag, n, a, lda, x, incx) call ztrsv(uplo, trans, diag, n, a, lda, x, incx) Fortran 95: call trsv(a, x [,uplo] [, trans] [,diag]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trsv routines solve one of the systems of equations: A*x = b, or A'*x = b, or conjg(A')*x = b, where: b and x are n-element vectors, A is an n-by-n unit, or non-unit, upper or lower triangular matrix. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans CHARACTER*1. Specifies the systems of equations: if trans = 'N' or 'n', then A*x = b; if trans = 'T' or 't', then A'*x = b; if trans = 'C' or 'c', then oconjg(A')*x = b. diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n INTEGER. Specifies the order of the matrix A. The value of n must be at least zero. a REAL for strsv BLAS and Sparse BLAS Routines 2 117 DOUBLE PRECISION for dtrsv COMPLEX for ctrsv DOUBLE COMPLEX for ztrsv Array, DIMENSION (lda,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, n). x REAL for strsv DOUBLE PRECISION for dtrsv COMPLEX for ctrsv DOUBLE COMPLEX for ztrsv Array, DIMENSION at least (1 + (n - 1)*abs(incx)). Before entry, the incremented array x must contain the n-element right-hand side vector b. incx INTEGER. Specifies the increment for the elements of x. The value of incx must not be zero. Output Parameters x Overwritten with the solution vector x. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsv interface are the following: a Holds the matrix a of size (n,n). x Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. BLAS Level 3 Routines BLAS Level 3 routines perform matrix-matrix operations. Table “BLAS Level 3 Routine Groups and Their Data Types” lists the BLAS Level 3 routine groups and the data types associated with them. BLAS Level 3 Routine Groups and Their Data Types Routine Group Data Types Description ?gemm s, d, c, z Matrix-matrix product of general matrices ?hemm c, z Matrix-matrix product of Hermitian matrices ?herk c, z Rank-k update of Hermitian matrices 2 Intel® Math Kernel Library Reference Manual 118 Routine Group Data Types Description ?her2k c, z Rank-2k update of Hermitian matrices ?symm s, d, c, z Matrix-matrix product of symmetric matrices ?syrk s, d, c, z Rank-k update of symmetric matrices ?syr2k s, d, c, z Rank-2k update of symmetric matrices ?trmm s, d, c, z Matrix-matrix product of triangular matrices ?trsm s, d, c, z Linear matrix-matrix solution for triangular matrices Symmetric Multiprocessing Version of Intel® MKL Many applications spend considerable time executing BLAS routines. This time can be scaled by the number of processors available on the system through using the symmetric multiprocessing (SMP) feature built into the Intel MKL Library. The performance enhancements based on the parallel use of the processors are available without any programming effort on your part. To enhance performance, the library uses the following methods: • The BLAS functions are blocked where possible to restructure the code in a way that increases the localization of data reference, enhances cache memory use, and reduces the dependency on the memory bus. • The code is distributed across the processors to maximize parallelism. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ?gemm Computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product. Syntax Fortran 77: call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call cgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call scgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dzgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call gemm(a, b, c [,transa][,transb] [,alpha][,beta]) BLAS and Sparse BLAS Routines 2 119 Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gemm routines perform a matrix-matrix operation with general matrices. The operation is defined as C := alpha*op(A)*op(B) + beta*C, where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'), alpha and beta are scalars, A, B and C are matrices: op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix. See also ?gemm3m, BLAS-like extension routines, that use matrix multiplication for similar matrix-matrix operations. Input Parameters transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). transb CHARACTER*1. Specifies the form of op(B) used in the matrix multiplication: if transb = 'N' or 'n', then op(B) = B; if transb = 'T' or 't', then op(B) = B'; if transb = 'C' or 'c', then op(B) = conjg(B'). m INTEGER. Specifies the number of rows of the matrix op(A) and of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix op(B) and the number of columns of the matrix C. The value of n must be at least zero. k INTEGER. Specifies the number of columns of the matrix op(A) and the number of rows of the matrix op(B). The value of k must be at least zero. alpha REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Specifies the scalar alpha. a REAL for sgemm, scgemm DOUBLE PRECISION for dgemm, dzgemm COMPLEX for cgemm DOUBLE COMPLEX for zgemm 2 Intel® Math Kernel Library Reference Manual 120 Array, DIMENSION (lda, ka), where ka is k when transa = 'N' or 'n', and is m otherwise. Before entry with transa = 'N' or 'n', the leading mby- k part of the array a must contain the matrix A, otherwise the leading kby- m part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When transa = 'N' or 'n', then lda must be at least max(1, m), otherwise lda must be at least max(1, k). b REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Array, DIMENSION (ldb, kb), where kb is n when transb = 'N' or 'n', and is k otherwise. Before entry with transb = 'N' or 'n', the leading kby- n part of the array b must contain the matrix B, otherwise the leading nby- k part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When transb = 'N' or 'n', then ldb must be at least max(1, k), otherwise ldb must be at least max(1, n). beta REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Specifies the scalar beta. When beta is equal to zero, then c need not be set on input. c REAL for sgemm DOUBLE PRECISION for dgemm COMPLEX for cgemm, scgemm DOUBLE COMPLEX for zgemm, dzgemm Array, DIMENSION (ldc, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is equal to zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n matrix (alpha*op(A)*op(B) + beta*C). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gemm interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = m otherwise, ma = m if transa= 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where BLAS and Sparse BLAS Routines 2 121 kb = n if transb = 'N', kb = k otherwise, mb = k if transb = 'N', mb = n otherwise. c Holds the matrix C of size (m,n). transa Must be 'N', 'C', or 'T'. The default value is 'N'. transb Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?hemm Computes a scalar-matrix-matrix product (either one of the matrices is Hermitian) and adds the result to scalar-matrix product. Syntax Fortran 77: call chemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call zhemm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call hemm(a, b, c [,side][,uplo] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?hemm routines perform a matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*B + beta*C or C := alpha*B*A + beta*C, where: alpha and beta are scalars, A is an Hermitian matrix, B and C are m-by-n matrices. Input Parameters side CHARACTER*1. Specifies whether the Hermitian matrix A appears on the left or right in the operation as follows: if side = 'L' or 'l', then C := alpha*A*B + beta*C; if side = 'R' or 'r', then C := alpha*B*A + beta*C. 2 Intel® Math Kernel Library Reference Manual 122 uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian matrix A is used: If uplo = 'U' or 'u', then the upper triangular part of the Hermitian matrix A is used. If uplo = 'L' or 'l', then the low triangular part of the Hermitian matrix A is used. m INTEGER. Specifies the number of rows of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix C. The value of n must be at least zero. alpha COMPLEX for chemm DOUBLE COMPLEX for zhemm Specifies the scalar alpha. a COMPLEX for chemm DOUBLE COMPLEX for zhemm Array, DIMENSION (lda,ka), where ka is m when side = 'L' or 'l' and is n otherwise. Before entry with side = 'L' or 'l', the m-by-m part of the array a must contain the Hermitian matrix, such that when uplo = 'U' or 'u', the leading m-by-m upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading m-by-m lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix, and the strictly upper triangular part of a is not referenced. Before entry with side = 'R' or 'r', the n-by-n part of the array a must contain the Hermitian matrix, such that when uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the Hermitian matrix, and the strictly upper triangular part of a is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub) program. When side = 'L' or 'l' then lda must be at least max(1, m), otherwise lda must be at least max(1,n). b COMPLEX for chemm DOUBLE COMPLEX for zhemm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). beta COMPLEX for chemm DOUBLE COMPLEX for zhemm Specifies the scalar beta. When beta is supplied as zero, then c need not be set on input. c COMPLEX for chemm DOUBLE COMPLEX for zhemm BLAS and Sparse BLAS Routines 2 123 Array, DIMENSION (c, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hemm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. b Holds the matrix B of size (m,n). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?herk Performs a rank-k update of a Hermitian matrix. Syntax Fortran 77: call cherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call zherk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) Fortran 95: call herk(a, c [,uplo] [, trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?herk routines perform a matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*conjg(A') + beta*C, or C := alpha*conjg(A')*A + beta*C, where: 2 Intel® Math Kernel Library Reference Manual 124 alpha and beta are real scalars, C is an n-by-n Hermitian matrix, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C:= alpha*A*conjg(A')+beta*C; if trans = 'C' or 'c', then C:= alpha*conjg(A')*A+beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. With trans = 'N' or 'n', k specifies the number of columns of the matrix A, and with trans = 'C' or 'c', k specifies the number of rows of the matrix A. The value of k must be at least zero. alpha REAL for cherk DOUBLE PRECISION for zherk Specifies the scalar alpha. a COMPLEX for cherk DOUBLE COMPLEX for zherk Array, DIMENSION (lda, ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix a, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). beta REAL for cherk DOUBLE PRECISION for zherk Specifies the scalar beta. c COMPLEX for cherk DOUBLE COMPLEX for zherk Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of c is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). BLAS and Sparse BLAS Routines 2 125 Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine herk interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = n otherwise, ma = n if transa= 'N', ma = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?her2k Performs a rank-2k update of a Hermitian matrix. Syntax Fortran 77: call cher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zher2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call her2k(a, b, c [,uplo][,trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?her2k routines perform a rank-2k matrix-matrix operation using Hermitian matrices. The operation is defined as C := alpha*A*conjg(B') + conjg(alpha)*B*conjg(A') + beta*C, or C := alpha *conjg(B')*A + conjg(alpha) *conjg(A')*B + beta*C, where: 2 Intel® Math Kernel Library Reference Manual 126 alpha is a scalar and beta is a real scalar, C is an n-by-n Hermitian matrix, A and B are n-by-k matrices in the first case and k-by-n matrices in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular of the array c is used. If uplo = 'L' or 'l', then the low triangular of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C:=alpha*A*conjg(B') + alpha*B*conjg(A') + beta*C; if trans = 'C' or 'c', then C:=alpha*conjg(A')*B + alpha*conjg(B')*A + beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. With trans = 'N' or 'n', k specifies the number of columns of the matrix A, and with trans = 'C' or 'c', k specifies the number of rows of the matrix A. The value of k must be at least equal to zero. alpha COMPLEX for cher2k DOUBLE COMPLEX for zher2k Specifies the scalar alpha. a COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (lda, ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). beta REAL for cher2k DOUBLE PRECISION for zher2k Specifies the scalar beta. b COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (ldb, kb), where kb is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array b must contain the matrix B, otherwise the leading kby- n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When trans = 'N' or 'n', then ldb must be at least max(1, n), otherwise ldb must be at least max(1, k). c COMPLEX for cher2k DOUBLE COMPLEX for zher2k Array, DIMENSION (ldc,n). BLAS and Sparse BLAS Routines 2 127 Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of c is not referenced. The imaginary parts of the diagonal elements need not be set, they are assumed to be zero. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. The imaginary parts of the diagonal elements are set to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine her2k interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if trans = 'N', ka = n otherwise, ma = n if trans = 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = k if trans = 'N', kb = n otherwise, mb = n if trans = 'N', mb = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?symm Performs a scalar-matrix-matrix product (one matrix operand is symmetric) and adds the result to a scalarmatrix product. Syntax Fortran 77: call ssymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call dsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) call csymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) 2 Intel® Math Kernel Library Reference Manual 128 call zsymm(side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call symm(a, b, c [,side][,uplo] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?symm routines perform a matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*B + beta*C, or C := alpha*B*A + beta*C, where: alpha and beta are scalars, A is a symmetric matrix, B and C are m-by-n matrices. Input Parameters side CHARACTER*1. Specifies whether the symmetric matrix A appears on the left or right in the operation: if side = 'L' or 'l', then C := alpha*A*B + beta*C; if side = 'R' or 'r', then C := alpha*B*A + beta*C. uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric matrix A is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m INTEGER. Specifies the number of rows of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix C. The value of n must be at least zero. alpha REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Specifies the scalar alpha. a REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (lda, ka), where ka is m when side = 'L' or 'l' and is n otherwise. Before entry with side = 'L' or 'l', the m-by-m part of the array a must contain the symmetric matrix, such that when uplo = 'U' or 'u', the leading m-by-m upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part BLAS and Sparse BLAS Routines 2 129 of a is not referenced, and when uplo = 'L' or 'l', the leading m-by-m lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. Before entry with side = 'R' or 'r', the n-by-n part of the array a must contain the symmetric matrix, such that when uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array a must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of a is not referenced, and when uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array a must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of a is not referenced. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l' then lda must be at least max(1, m), otherwise lda must be at least max(1, n). b REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). beta REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Specifies the scalar beta. When beta is set to zero, then c need not be set on input. c REAL for ssymm DOUBLE PRECISION for dsymm COMPLEX for csymm DOUBLE COMPLEX for zsymm Array, DIMENSION (ldc,n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine symm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. 2 Intel® Math Kernel Library Reference Manual 130 b Holds the matrix B of size (m,n). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. alpha The default value is 1. beta The default value is 0. ?syrk Performs a rank-n update of a symmetric matrix. Syntax Fortran 77: call ssyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call dsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call csyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) call zsyrk(uplo, trans, n, k, alpha, a, lda, beta, c, ldc) Fortran 95: call syrk(a, c [,uplo] [, trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syrk routines perform a matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*A' + beta*C, or C := alpha*A'*A + beta*C, where: alpha and beta are scalars, C is an n-by-n symmetric matrix, A is an n-by-k matrix in the first case and a k-by-n matrix in the second case. Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*A' + beta*C; if trans = 'T' or 't', then C := alpha*A'*A + beta*C; if trans = 'C' or 'c', then C := alpha*A'*A + beta*C. BLAS and Sparse BLAS Routines 2 131 n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrix a, and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the matrix a. The value of k must be at least zero. alpha REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Specifies the scalar alpha. a REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1,n), otherwise lda must be at least max(1, k). beta REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Specifies the scalar beta. c REAL for ssyrk DOUBLE PRECISION for dsyrk COMPLEX for csyrk DOUBLE COMPLEX for zsyrk Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of c is not referenced. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. 2 Intel® Math Kernel Library Reference Manual 132 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syrk interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = n otherwise, ma = n if transa= 'N', ma = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?syr2k Performs a rank-2k update of a symmetric matrix. Syntax Fortran 77: call ssyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call dsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call csyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zsyr2k(uplo, trans, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call syr2k(a, b, c [,uplo][,trans] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?syr2k routines perform a rank-2k matrix-matrix operation using symmetric matrices. The operation is defined as C := alpha*A*B' + alpha*B*A' + beta*C, or C := alpha*A'*B + alpha*B'*A + beta*C, where: alpha and beta are scalars, C is an n-by-n symmetric matrix, A and B are n-by-k matrices in the first case, and k-by-n matrices in the second case. BLAS and Sparse BLAS Routines 2 133 Input Parameters uplo CHARACTER*1. Specifies whether the upper or lower triangular part of the array c is used. If uplo = 'U' or 'u', then the upper triangular part of the array c is used. If uplo = 'L' or 'l', then the low triangular part of the array c is used. trans CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then C := alpha*A*B'+alpha*B*A'+beta*C; if trans = 'T' or 't', then C := alpha*A'*B +alpha*B'*A +beta*C; if trans = 'C' or 'c', then C := alpha*A'*B +alpha*B'*A +beta*C. n INTEGER. Specifies the order of the matrix C. The value of n must be at least zero. k INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the matrices A and B, and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the matrices A and B. The value of k must be at least zero. alpha REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Specifies the scalar alpha. a REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (lda,ka), where ka is k when trans = 'N' or 'n', and is n otherwise. Before entry with trans = 'N' or 'n', the leading nby- k part of the array a must contain the matrix A, otherwise the leading kby- n part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then lda must be at least max(1, n), otherwise lda must be at least max(1, k). b REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (ldb, kb) where kb is k when trans = 'N' or 'n' and is 'n' otherwise. Before entry with trans = 'N' or 'n', the leading n-byk part of the array b must contain the matrix B, otherwise the leading k-byn part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When trans = 'N' or 'n', then ldb must be at least max(1, n), otherwise ldb must be at least max(1, k). beta REAL for ssyr2k DOUBLE PRECISION for dsyr2k COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Specifies the scalar beta. c REAL for ssyr2k DOUBLE PRECISION for dsyr2k 2 Intel® Math Kernel Library Reference Manual 134 COMPLEX for csyr2k DOUBLE COMPLEX for zsyr2k Array, DIMENSION (ldc,n). Before entry with uplo = 'U' or 'u', the leading n-by-n upper triangular part of the array c must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of c is not referenced. Before entry with uplo = 'L' or 'l', the leading n-by-n lower triangular part of the array c must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of c is not referenced. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, n). Output Parameters c With uplo = 'U' or 'u', the upper triangular part of the array c is overwritten by the upper triangular part of the updated matrix. With uplo = 'L' or 'l', the lower triangular part of the array c is overwritten by the lower triangular part of the updated matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syr2k interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if trans = 'N', ka = n otherwise, ma = n if trans = 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = k if trans = 'N', kb = n otherwise, mb = n if trans = 'N', mb = k otherwise. c Holds the matrix C of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 0. ?trmm Computes a scalar-matrix-matrix product (one matrix operand is triangular). Syntax Fortran 77: call strmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call dtrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) BLAS and Sparse BLAS Routines 2 135 call ctrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ztrmm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) Fortran 95: call trmm(a, b [,side] [, uplo] [,transa][,diag] [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trmm routines perform a matrix-matrix operation using triangular matrices. The operation is defined as B := alpha*op(A)*B or B := alpha*B*op(A) where: alpha is a scalar, B is an m-by-n matrix, A is a unit, or non-unit, upper or lower triangular matrix op(A) is one of op(A) = A, or op(A) = A', or op(A) = conjg(A'). Input Parameters side CHARACTER*1. Specifies whether op(A) appears on the left or right of B in the operation: if side = 'L' or 'l', then B := alpha*op(A)*B; if side = 'R' or 'r', then B := alpha*B*op(A). uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). diag CHARACTER*1. Specifies whether the matrix A is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m INTEGER. Specifies the number of rows of B. The value of m must be at least zero. n INTEGER. Specifies the number of columns of B. The value of n must be at least zero. alpha REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Specifies the scalar alpha. 2 Intel® Math Kernel Library Reference Manual 136 When alpha is zero, then a is not referenced and b need not be set before entry. a REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Array, DIMENSION (lda,k), where k is m when side = 'L' or 'l' and is n when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', the leading k by k upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading k by k lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l', then lda must be at least max(1, m), when side = 'R' or 'r', then lda must be at least max(1, n). b REAL for strmm DOUBLE PRECISION for dtrmm COMPLEX for ctrmm DOUBLE COMPLEX for ztrmm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, m). Output Parameters b Overwritten by the transformed matrix. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trmm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. b Holds the matrix B of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. transa Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. alpha The default value is 1. BLAS and Sparse BLAS Routines 2 137 ?trsm Solves a matrix equation (one matrix operand is triangular). Syntax Fortran 77: call strsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call dtrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ctrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) call ztrsm(side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb) Fortran 95: call trsm(a, b [,side] [, uplo] [,transa][,diag] [,alpha]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?trsm routines solve one of the following matrix equations: op(A)*X = alpha*B, or X*op(A) = alpha*B, where: alpha is a scalar, X and B are m-by-n matrices, A is a unit, or non-unit, upper or lower triangular matrix op(A) is one of op(A) = A, or op(A) = A', or op(A) = conjg(A'). The matrix B is overwritten by the solution matrix X. Input Parameters side CHARACTER*1. Specifies whether op(A) appears on the left or right of X in the equation: if side = 'L' or 'l', then op(A)*X = alpha*B; if side = 'R' or 'r', then X*op(A) = alpha*B. uplo CHARACTER*1. Specifies whether the matrix A is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). diag CHARACTER*1. Specifies whether the matrix A is unit triangular: 2 Intel® Math Kernel Library Reference Manual 138 if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m INTEGER. Specifies the number of rows of B. The value of m must be at least zero. n INTEGER. Specifies the number of columns of B. The value of n must be at least zero. alpha REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Specifies the scalar alpha. When alpha is zero, then a is not referenced and b need not be set before entry. a REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Array, DIMENSION (lda, k), where k is m when side = 'L' or 'l' and is n when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', the leading k by k upper triangular part of the array a must contain the upper triangular matrix and the strictly lower triangular part of a is not referenced. Before entry with uplo = 'L' or 'l', the leading k by k lower triangular part of the array a must contain the lower triangular matrix and the strictly upper triangular part of a is not referenced. When diag = 'U' or 'u', the diagonal elements of a are not referenced either, but are assumed to be unity. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When side = 'L' or 'l', then lda must be at least max(1, m), when side = 'R' or 'r', then lda must be at least max(1, n). b REAL for strsm DOUBLE PRECISION for dtrsm COMPLEX for ctrsm DOUBLE COMPLEX for ztrsm Array, DIMENSION (ldb,n). Before entry, the leading m-by-n part of the array b must contain the right-hand side matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. The value of ldb must be at least max(1, +m). Output Parameters b Overwritten by the solution matrix X. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trsm interface are the following: a Holds the matrix A of size (k,k) where k = m if side = 'L', k = n otherwise. BLAS and Sparse BLAS Routines 2 139 b Holds the matrix B of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. transa Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. alpha The default value is 1. Sparse BLAS Level 1 Routines This section describes Sparse BLAS Level 1, an extension of BLAS Level 1 included in the Intel® Math Kernel Library beginning with the Intel MKL release 2.1. Sparse BLAS Level 1 is a group of routines and functions that perform a number of common vector operations on sparse vectors stored in compressed form. Sparse vectors are those in which the majority of elements are zeros. Sparse BLAS routines and functions are specially implemented to take advantage of vector sparsity. This allows you to achieve large savings in computer time and memory. If nz is the number of non-zero vector elements, the computer time taken by Sparse BLAS operations will be O(nz). Vector Arguments Compressed sparse vectors. Let a be a vector stored in an array, and assume that the only non-zero elements of a are the following: a(k1), a (k2), a (k3) . . . a(knz), where nz is the total number of non-zero elements in a. In Sparse BLAS, this vector can be represented in compressed form by two FORTRAN arrays, x (values) and indx (indices). Each array has nz elements: x(1)=a(k1), x(2)=a(k2), . . . x(nz)= a(knz), indx(1)=k1, indx(2)=k2, . . . indx(nz)= knz. Thus, a sparse vector is fully determined by the triple (nz, x, indx). If you pass a negative or zero value of nz to Sparse BLAS, the subroutines do not modify any arrays or variables. Full-storage vectors. Sparse BLAS routines can also use a vector argument fully stored in a single FORTRAN array (a full-storage vector). If y is a full-storage vector, its elements must be stored contiguously: the first element in y(1), the second in y(2), and so on. This corresponds to an increment incy = 1 in BLAS Level 1. No increment value for full-storage vectors is passed as an argument to Sparse BLAS routines or functions. Naming Conventions Similar to BLAS, the names of Sparse BLAS subprograms have prefixes that determine the data type involved: s and d for single- and double-precision real; c and z for single- and double-precision complex respectively. If a Sparse BLAS routine is an extension of a "dense" one, the subprogram name is formed by appending the suffix i (standing for indexed) to the name of the corresponding "dense" subprogram. For example, the Sparse BLAS routine saxpyi corresponds to the BLAS routine saxpy, and the Sparse BLAS function cdotci corresponds to the BLAS function cdotc. 2 Intel® Math Kernel Library Reference Manual 140 Routines and Data Types Routines and data types supported in the Intel MKL implementation of Sparse BLAS are listed in Table “Sparse BLAS Routines and Their Data Types”. Sparse BLAS Routines and Their Data Types Routine/ Function Data Types Description ?axpyi s, d, c, z Scalar-vector product plus vector (routines) ?doti s, d Dot product (functions) ?dotci c, z Complex dot product conjugated (functions) ?dotui c, z Complex dot product unconjugated (functions) ?gthr s, d, c, z Gathering a full-storage sparse vector into compressed form nz, x, indx (routines) ?gthrz s, d, c, z Gathering a full-storage sparse vector into compressed form and assigning zeros to gathered elements in the fullstorage vector (routines) ?roti s, d Givens rotation (routines) ?sctr s, d, c, z Scattering a vector from compressed form to full-storage form (routines) BLAS Level 1 Routines That Can Work With Sparse Vectors The following BLAS Level 1 routines will give correct results when you pass to them a compressed-form array x(with the increment incx=1): ?asum sum of absolute values of vector elements ?copy copying a vector ?nrm2 Euclidean norm of a vector ?scal scaling a vector i?amax index of the element with the largest absolute value for real flavors, or the largest sum |Re(x(i))|+|Im(x(i))| for complex flavors. i?amin index of the element with the smallest absolute value for real flavors, or the smallest sum |Re(x(i))|+|Im(x(i))| for complex flavors. The result i returned by i?amax and i?amin should be interpreted as index in the compressed-form array, so that the largest (smallest) value is x(i); the corresponding index in full-storage array is indx(i). You can also call ?rotg to compute the parameters of Givens rotation and then pass these parameters to the Sparse BLAS routines ?roti. ?axpyi Adds a scalar multiple of compressed sparse vector to a full-storage vector. Syntax Fortran 77: call saxpyi(nz, a, x, indx, y) BLAS and Sparse BLAS Routines 2 141 call daxpyi(nz, a, x, indx, y) call caxpyi(nz, a, x, indx, y) call zaxpyi(nz, a, x, indx, y) Fortran 95: call axpyi(x, indx, y [, a]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpyi routines perform a vector-vector operation defined as y := a*x + y where: a is a scalar, x is a sparse vector stored in compressed form, y is a vector in full storage form. The ?axpyi routines reference or modify only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx. a REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Specifies the scalar a. x REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for saxpyi DOUBLE PRECISION for daxpyi COMPLEX for caxpyi DOUBLE COMPLEX for zaxpyi Array, DIMENSION at least max(indx(i)). Output Parameters y Contains the updated vector y. 2 Intel® Math Kernel Library Reference Manual 142 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpyi interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. a The default value is 1. ?doti Computes the dot product of a compressed sparse real vector by a full-storage real vector. Syntax Fortran 77: res = sdoti(nz, x, indx, y ) res = ddoti(nz, x, indx, y ) Fortran 95: res = doti(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?doti routines return the dot product of x and y defined as res = x(1)*y(indx(1)) + x(2)*y(indx(2)) +...+ x(nz)*y(indx(nz)) where the triple (nz, x, indx) defines a sparse real vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . x REAL for sdoti DOUBLE PRECISION for ddoti Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for sdoti DOUBLE PRECISION for ddoti Array, DIMENSION at least max(indx(i)). BLAS and Sparse BLAS Routines 2 143 Output Parameters res REAL for sdoti DOUBLE PRECISION for ddoti Contains the dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine doti interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?dotci Computes the conjugated dot product of a compressed sparse complex vector with a full-storage complex vector. Syntax Fortran 77: res = cdotci(nz, x, indx, y ) res = zdotci(nz, x, indx, y ) Fortran 95: res = dotci(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotci routines return the dot product of x and y defined as conjg(x(1))*y(indx(1)) + ... + conjg(x(nz))*y(indx(nz)) where the triple (nz, x, indx) defines a sparse complex vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . x COMPLEX for cdotci DOUBLE COMPLEX for zdotci Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. 2 Intel® Math Kernel Library Reference Manual 144 Array, DIMENSION at least nz. y COMPLEX for cdotci DOUBLE COMPLEX for zdotci Array, DIMENSION at least max(indx(i)). Output Parameters res COMPLEX for cdotci DOUBLE COMPLEX for zdotci Contains the conjugated dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotci interface are the following: x Holds the vector with the number of elements (nz). indx Holds the vector with the number of elements (nz). y Holds the vector with the number of elements (nz). ?dotui Computes the dot product of a compressed sparse complex vector by a full-storage complex vector. Syntax Fortran 77: res = cdotui(nz, x, indx, y ) res = zdotui(nz, x, indx, y ) Fortran 95: res = dotui(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?dotui routines return the dot product of x and y defined as res = x(1)*y(indx(1)) + x(2)*y(indx(2)) +...+ x(nz)*y(indx(nz)) where the triple (nz, x, indx) defines a sparse complex vector stored in compressed form, and y is a real vector in full storage form. The functions reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx . BLAS and Sparse BLAS Routines 2 145 x COMPLEX for cdotui DOUBLE COMPLEX for zdotui Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y COMPLEX for cdotui DOUBLE COMPLEX for zdotui Array, DIMENSION at least max(indx(i)). Output Parameters res COMPLEX for cdotui DOUBLE COMPLEX for zdotui Contains the dot product of x and y, if nz is positive. Otherwise, res contains 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine dotui interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?gthr Gathers a full-storage sparse vector's elements into compressed form. Syntax Fortran 77: call sgthr(nz, y, x, indx ) call dgthr(nz, y, x, indx ) call cgthr(nz, y, x, indx ) call zgthr(nz, y, x, indx ) Fortran 95: res = gthr(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gthr routines gather the specified elements of a full-storage sparse vector y into compressed form(nz, x, indx). The routines reference only the elements of y whose indices are listed in the array indx: 2 Intel® Math Kernel Library Reference Manual 146 x(i) = y(indx(i)), for i=1,2,... +nz. Input Parameters nz INTEGER. The number of elements of y to be gathered. indx INTEGER. Specifies indices of elements to be gathered. Array, DIMENSION at least nz. y REAL for sgthr DOUBLE PRECISION for dgthr COMPLEX for cgthr DOUBLE COMPLEX for zgthr Array, DIMENSION at least max(indx(i)). Output Parameters x REAL for sgthr DOUBLE PRECISION for dgthr COMPLEX for cgthr DOUBLE COMPLEX for zgthr Array, DIMENSION at least nz. Contains the vector converted to the compressed form. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gthr interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?gthrz Gathers a sparse vector's elements into compressed form, replacing them by zeros. Syntax Fortran 77: call sgthrz(nz, y, x, indx ) call dgthrz(nz, y, x, indx ) call cgthrz(nz, y, x, indx ) call zgthrz(nz, y, x, indx ) Fortran 95: res = gthrz(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h BLAS and Sparse BLAS Routines 2 147 Description The ?gthrz routines gather the elements with indices specified by the array indx from a full-storage vector y into compressed form (nz, x, indx) and overwrite the gathered elements of y by zeros. Other elements of y are not referenced or modified (see also ?gthr). Input Parameters nz INTEGER. The number of elements of y to be gathered. indx INTEGER. Specifies indices of elements to be gathered. Array, DIMENSION at least nz. y REAL for sgthrz DOUBLE PRECISION for dgthrz COMPLEX for cgthrz DOUBLE COMPLEX for zgthrz Array, DIMENSION at least max(indx(i)). Output Parameters x REAL for sgthrz DOUBLE PRECISION for d gthrz COMPLEX for cgthrz DOUBLE COMPLEX for zgthrz Array, DIMENSION at least nz. Contains the vector converted to the compressed form. y The updated vector y. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gthrz interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?roti Applies Givens rotation to sparse vectors one of which is in compressed form. Syntax Fortran 77: call sroti(nz, x, indx, y, c, s) call droti(nz, x, indx, y, c, s) Fortran 95: call roti(x, indx, y, c, s) Include Files • FORTRAN 77: mkl_blas.fi 2 Intel® Math Kernel Library Reference Manual 148 • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?roti routines apply the Givens rotation to elements of two real vectors, x (in compressed form nz, x, indx) and y (in full storage form): x(i) = c*x(i) + s*y(indx(i)) y(indx(i)) = c*y(indx(i))- s*x(i) The routines reference only the elements of y whose indices are listed in the array indx. The values in indx must be distinct. Input Parameters nz INTEGER. The number of elements in x and indx. x REAL for sroti DOUBLE PRECISION for droti Array, DIMENSION at least nz. indx INTEGER. Specifies the indices for the elements of x. Array, DIMENSION at least nz. y REAL for sroti DOUBLE PRECISION for droti Array, DIMENSION at least max(indx(i)). c A scalar: REAL for sroti DOUBLE PRECISION for droti. s A scalar: REAL for sroti DOUBLE PRECISION for droti. Output Parameters x and y The updated arrays. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine roti interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. ?sctr Converts compressed sparse vectors into full storage form. Syntax Fortran 77: call ssctr(nz, x, indx, y ) call dsctr(nz, x, indx, y ) BLAS and Sparse BLAS Routines 2 149 call csctr(nz, x, indx, y ) call zsctr(nz, x, indx, y ) Fortran 95: call sctr(x, indx, y) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?sctr routines scatter the elements of the compressed sparse vector (nz, x, indx) to a full-storage vector y. The routines modify only the elements of y whose indices are listed in the array indx: y(indx(i) = x(i), for i=1,2,... +nz. Input Parameters nz INTEGER. The number of elements of x to be scattered. indx INTEGER. Specifies indices of elements to be scattered. Array, DIMENSION at least nz. x REAL for ssctr DOUBLE PRECISION for dsctr COMPLEX for csctr DOUBLE COMPLEX for zsctr Array, DIMENSION at least nz. Contains the vector to be converted to full-storage form. Output Parameters y REAL for ssctr DOUBLE PRECISION for dsctr COMPLEX for csctr DOUBLE COMPLEX for zsctr Array, DIMENSION at least max(indx(i)). Contains the vector y with updated elements. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sctr interface are the following: x Holds the vector with the number of elements nz. indx Holds the vector with the number of elements nz. y Holds the vector with the number of elements nz. 2 Intel® Math Kernel Library Reference Manual 150 Sparse BLAS Level 2 and Level 3 Routines This section describes Sparse BLAS Level 2 and Level 3 routines included in the Intel® Math Kernel Library (Intel® MKL) . Sparse BLAS Level 2 is a group of routines and functions that perform operations between a sparse matrix and dense vectors. Sparse BLAS Level 3 is a group of routines and functions that perform operations between a sparse matrix and dense matrices. The terms and concepts required to understand the use of the Intel MKL Sparse BLAS Level 2 and Level 3 routines are discussed in the Linear Solvers Basics appendix. The Sparse BLAS routines can be useful to implement iterative methods for solving large sparse systems of equations or eigenvalue problems. For example, these routines can be considered as building blocks for Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS) described in the Chapter 8 of the manual. Intel MKL provides Sparse BLAS Level 2 and Level 3 routines with typical (or conventional) interface similar to the interface used in the NIST* Sparse BLAS library [Rem05]. Some software packages and libraries (the PARDISO* Solver used in Intel MKL, Sparskit 2 [Saad94], the Compaq* Extended Math Library (CXML)[CXML01]) use different (early) variation of the compressed sparse row (CSR) format and support only Level 2 operations with simplified interfaces. Intel MKL provides an additional set of Sparse BLAS Level 2 routines with similar simplified interfaces. Each of these routines operates only on a matrix of the fixed type. The routines described in this section support both one-based indexing and zero-based indexing of the input data (see details in the section One-based and Zero-based Indexing). Naming Conventions in Sparse BLAS Level 2 and Level 3 Each Sparse BLAS Level 2 and Level 3 routine has a six- or eight-character base name preceded by the prefix mkl_ or mkl_cspblas_ . The routines with typical (conventional) interface have six-character base names in accordance with the template: mkl_ ( ) The routines with simplified interfaces have eight-character base names in accordance with the templates: mkl_ ( ) for routines with one-based indexing; and mkl_cspblas_ ( ) for routines with zero-based indexing. The field indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision The field indicates the sparse matrix storage format (see section Sparse Matrix Storage Formats): coo coordinate format csr compressed sparse row format and its variations csc compressed sparse column format and its variations dia diagonal format sky skyline storage format bsr block sparse row format and its variations The field indicates the type of operation: BLAS and Sparse BLAS Routines 2 151 mv matrix-vector product (Level 2) mm matrix-matrix product (Level 3) sv solving a single triangular system (Level 2) sm solving triangular systems with multiple right-hand sides (Level 3) The field indicates the matrix type: ge sparse representation of a general matrix sy sparse representation of the upper or lower triangle of a symmetric matrix tr sparse representation of a triangular matrix Sparse Matrix Storage Formats The current version of Intel MKL Sparse BLAS Level 2 and Level 3 routines support the following point entry [Duff86] storage formats for sparse matrices: • compressed sparse row format (CSR) and its variations; • compressed sparse column format (CSC); • coordinate format; • diagonal format; • skyline storage format; and one block entry storage format: • block sparse row format (BSR) and its variations. For more information see "Sparse Matrix Storage Formats" in Appendix A. Intel MKL provides auxiliary routines - matrix converters - that convert sparse matrix from one storage format to another. Routines and Supported Operations This section describes operations supported by the Intel MKL Sparse BLAS Level 2 and Level 3 routines. The following notations are used here: A is a sparse matrix; B and C are dense matrices; D is a diagonal scaling matrix; x and y are dense vectors; alpha and beta are scalars; op(A) is one of the possible operations: op(A) = A; op(A) = A' - transpose of A; op(A) = conj(A') - conjugated transpose of A. inv(op(A)) denotes the inverse of op(A). The Intel MKL Sparse BLAS Level 2 and Level 3 routines support the following operations: • computing the vector product between a sparse matrix and a dense vector: y := alpha*op(A)*x + beta*y • solving a single triangular system: y := alpha*inv(op(A))*x 2 Intel® Math Kernel Library Reference Manual 152 • computing a product between sparse matrix and dense matrix: C := alpha*op(A)*B + beta*C • solving a sparse triangular system with multiple right-hand sides: C := alpha*inv(op(A))*B Intel MKL provides an additional set of the Sparse BLAS Level 2 routines with simplified interfaces. Each of these routines operates on a matrix of the fixed type. The following operations are supported: • computing the vector product between a sparse matrix and a dense vector (for general and symmetric matrices): y := op(A)*x • solving a single triangular system (for triangular matrices): y := inv(op(A))*x Matrix type is indicated by the field in the routine name (see section Naming Conventions in Sparse BLAS Level 2 and Level 3). NOTE The routines with simplified interfaces support only four sparse matrix storage formats, specifically: CSR format in the 3-array variation accepted in the direct sparse solvers and in the CXML; diagonal format accepted in the CXML; coordinate format; BSR format in the 3-array variation. Note that routines with both typical (conventional) and simplified interfaces use the same computational kernels that work with certain internal data structures. The Intel MKL Sparse BLAS Level 2 and Level 3 routines do not support in-place operations. Complete list of all routines is given in the “Sparse BLAS Level 2 and Level 3 Routines”. Interface Consideration One-Based and Zero-Based Indexing The Intel MKL Sparse BLAS Level 2 and Level 3 routines support one-based and zero-based indexing of data arrays. Routines with typical interfaces support zero-based indexing for the following sparse data storage formats: CSR, CSC, BSR, and COO. Routines with simplified interfaces support zero based indexing for the following sparse data storage formats: CSR, BSR, and COO. See the complete list of Sparse BLAS Level 2 and Level 3 Routines. The one-based indexing uses the convention of starting array indices at 1. The zero-based indexing uses the convention of starting array indices at 0. For example, indices of the 5-element array x can be presented in case of one-based indexing as follows: Element index: 1 2 3 4 5 Element value: 1.0 5.0 7.0 8.0 9.0 and in case of zero-based indexing as follows: Element index: 0 1 2 3 4 Element value: 1.0 5.0 7.0 8.0 9.0 The detailed descriptions of the one-based and zero-based variants of the sparse data storage formats are given in the "Sparse Matrix Storage Formats" in Appendix A. BLAS and Sparse BLAS Routines 2 153 Most parameters of the routines are identical for both one-based and zero-based indexing, but some of them have certain differences. The following table lists all these differences. Parameter One-based Indexing Zero-based Indexing val Array containing non-zero elements of the matrix A, its length is pntre(m) - pntrb(1). Array containing non-zero elements of the matrix A, its length is pntre(m—1) - pntrb(0). pntrb Array of length m. This array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx Array of length m. This array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. pntre Array of length m. This array contains row indices, such that pntre(I) - pntrb(1) is the last index of row i in the arrays val and indx. Array of length m. This array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx. ia Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Array of length m+1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m) is equal to the number of nonzeros. ldb Specifies the leading dimension of b as declared in the calling (sub)program. Specifies the second dimension of b as declared in the calling (sub)program. ldc Specifies the leading dimension of c as declared in the calling (sub)program. Specifies the second dimension of c as declared in the calling (sub)program. Difference Between Fortran and C Interfaces Intel MKL provides both Fortran and C interfaces to all Sparse BLAS Level 2 and Level 3 routines. Parameter descriptions are common for both interfaces with the exception of data types that refer to the FORTRAN 77 standard types. Correspondence between data types specific to the Fortran and C interfaces are given below: Fortran C REAL*4 float REAL*8 double INTEGER*4 int INTEGER*8 long long int CHARACTER char For routines with C interfaces all parameters (including scalars) must be passed by references. Another difference is how two-dimensional arrays are represented. In Fortran the column-major order is used, and in C - row-major order. This changes the meaning of the parameters ldb and ldc (see the table above). Differences Between Intel MKL and NIST* Interfaces The Intel MKL Sparse BLAS Level 3 routines have the following conventional interfaces: 2 Intel® Math Kernel Library Reference Manual 154 mkl_xyyymm(transa, m, n, k, alpha, matdescra, arg(A), b, ldb, beta, c, ldc), for matrixmatrix product; mkl_xyyysm(transa, m, n, alpha, matdescra, arg(A), b, ldb, c, ldc), for triangular solvers with multiple right-hand sides. Here x denotes data type, and yyy - sparse matrix data structure (storage format). The analogous NIST* Sparse BLAS (NSB) library routines have the following interfaces: xyyymm(transa, m, n, k, alpha, descra, arg(A), b, ldb, beta, c, ldc, work, lwork), for matrix-matrix product; xyyysm(transa, m, n, unitd, dv, alpha, descra, arg(A), b, ldb, beta, c, ldc, work, lwork), for triangular solvers with multiple right-hand sides. Some similar arguments are used in both libraries. The argument transa indicates what operation is performed and is slightly different in the NSB library (see Table “Parameter transa”). The arguments m and k are the number of rows and column in the matrix A, respectively, n is the number of columns in the matrix C. The arguments alpha and beta are scalar alpha and beta respectively (beta is not used in the Intel MKL triangular solvers.) The arguments b and c are rectangular arrays with the leading dimension ldb and ldc, respectively. arg(A) denotes the list of arguments that describe the sparse representation of A. Parameter transa MKL interface NSB interface Operation data type CHARACTER*1 INTEGER value N or n 0 op(A) = A T or t 1 op(A) = A' C or c 2 op(A) = A' Parameter matdescra The parameter matdescra describes the relevant characteristic of the matrix A. This manual describes matdescra as an array of six elements in line with the NIST* implementation. However, only the first four elements of the array are used in the current versions of the Intel MKL Sparse BLAS routines. Elements matdescra(5) and matdescra(6) are reserved for future use. Note that whether matdescra is described in your application as an array of length 6 or 4 is of no importance because the array is declared as a pointer in the Intel MKL routines. To learn more about declaration of the matdescra array, see Sparse BLAS examples located in the following subdirectory of the Intel MKL installation directory: examples/spblas/. The table below lists elements of the parameter matdescra, their values and meanings. The parameter matdescra corresponds to the argument descra from NSB library. Possible Values of the Parameter matdescra (descra) MKL interface NSB interface Matrix characteristics one-based indexing zero-based indexing data type CHARACTER Char INTEGER 1st element matdescra(1) matdescra(0) descra(1) matrix structure value G G 0 general S S 1 symmetric (A = A') BLAS and Sparse BLAS Routines 2 155 MKL interface NSB interface Matrix characteristics H H 2 Hermitian (A=conjg(A')) T T 3 triangular A A 4 skew(anti)-symmetric (A=-A') D D 5 diagonal 2nd element matdescra(2) matdescra(1) descra(2) upper/lower triangular indicator value L L 1 lower U U 2 upper 3rd element matdescra(3) matdescra(2) descra(3) main diagonal type value N N 0 non-unit U U 1 unit 4th element matdescra(4) matdescra(3) type of indexing value F one-based indexing C zero-based indexing In some cases possible element values of the parameter matdescra depend on the values of other elements. The Table "Possible Combinations of Element Values of the Parameter matdescra" lists all possible combinations of element values for both multiplication routines and triangular solvers. Possible Combinations of Element Values of the Parameter matdescra Routines matdescra(1) matdescra(2) matdescra(3) matdescra(4) Multiplication Routines G ignored ignored F (default) or C S or H L (default) N (default) F (default) or C S or H L (default) U F (default) or C S or H U N (default) F (default) or C S or H U U F (default) or C A L (default) ignored F (default) or C A U ignored F (default) or C Multiplication Routines and Triangular Solvers T L U F (default) or C T L N F (default) or C T U U F (default) or C T U N F (default) or C D ignored N (default) F (default) or C D ignored U F (default) or C For a matrix in the skyline format with the main diagonal declared to be a unit, diagonal elements must be stored in the sparse representation even if they are zero. In all other formats, diagonal elements can be stored (if needed) in the sparse representation if they are not zero. 2 Intel® Math Kernel Library Reference Manual 156 Operations with Partial Matrices One of the distinctive feature of the Intel MKL Sparse BLAS routines is a possibility to perform operations only on partial matrices composed of certain parts (triangles and the main diagonal) of the input sparse matrix. It can be done by setting properly first three elements of the parameter matdescra. An arbitrary sparse matrix A can be decomposed as A = L + D + U where L is the strict lower triangle of A, U is the strict upper triangle of A, D is the main diagonal. Table "Output Matrices for Multiplication Routines" shows correspondence between the output matrices and values of the parameter matdescra for the sparse matrix A for multiplication routines. Output Matrices for Multiplication Routines matdescra(1) matdescra(2) matdescra(3) Output Matrix G ignored ignored alpha*op(A)*x + beta*y alpha*op(A)*B + beta*C S or H L N alpha*op(L+D+L')*x + beta*y alpha*op(L+D+L')*B + beta*C S or H L U alpha*op(L+I+L')*x + beta*y alpha*op(L+I+L')*B + beta*C S or H U N alpha*op(U'+D+U)*x + beta*y alpha*op(U'+D+U)*B + beta*C S or H U U alpha*op(U'+I+U)*x + beta*y alpha*op(U'+I+U)*B + beta*C T L U alpha*op(L+I)*x + beta*y alpha*op(L+I)*B + beta*C T L N alpha*op(L+D)*x + beta*y alpha*op(L+D)*B + beta*C T U U alpha*op(U+I)*x + beta*y alpha*op(U+I)*B + beta*C T U N alpha*op(U+D)*x + beta*y alpha*op(U+D)*B + beta*C A L ignored alpha*op(L-L')*x + beta*y alpha*op(L-L')*B + beta*C A U ignored alpha*op(U-U')*x + beta*y alpha*op(U-U')*B + beta*C D ignored N alpha*D*x + beta*y alpha*D*B + beta*C D ignored U alpha*x + beta*y alpha*B + beta*C Table “Output Matrices for Triangular Solvers” shows correspondence between the output matrices and values of the parameter matdescra for the sparse matrix A for triangular solvers. BLAS and Sparse BLAS Routines 2 157 Output Matrices for Triangular Solvers matdescra(1) matdescra(2) matdescra(3) Output Matrix T L N alpha*inv(op(L+L))*x alpha*inv(op(L+L))*B T L U alpha*inv(op(L+L))*x alpha*inv(op(L+L))*B T U N alpha*inv(op(U+U))*x alpha*inv(op(U+U))*B T U U alpha*inv(op(U+U))*x alpha*inv(op(U+U))*B D ignored N alpha*inv(D)*x alpha*inv(D)*B D ignored U alpha*x alpha*B Sparse BLAS Level 2 and Level 3 Routines. Table “Sparse BLAS Level 2 and Level 3 Routines” lists the sparse BLAS Level 2 and Level 3 routines described in more detail later in this section. Sparse BLAS Level 2 and Level 3 Routines Routine/Function Description Simplified interface, one-based indexing mkl_?csrgemv Computes matrix - vector product of a sparse general matrix in the CSR format (3-array variation) mkl_?bsrgemv Computes matrix - vector product of a sparse general matrix in the BSR format (3-array variation). mkl_?coogemv Computes matrix - vector product of a sparse general matrix in the coordinate format. mkl_?diagemv Computes matrix - vector product of a sparse general matrix in the diagonal format. mkl_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix in the CSR format (3-array variation) mkl_?bsrsymv Computes matrix - vector product of a sparse symmetrical matrix in the BSR format (3-array variation). mkl_?coosymv Computes matrix - vector product of a sparse symmetrical matrix in the coordinate format. mkl_?diasymv Computes matrix - vector product of a sparse symmetrical matrix in the diagonal format. mkl_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation). 2 Intel® Math Kernel Library Reference Manual 158 Routine/Function Description mkl_?bsrtrsv Triangular solver with simplified interface for a sparse matrix in the BSR format (3-array variation). mkl_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format. mkl_?diatrsv Triangular solvers with simplified interface for a sparse matrix in the diagonal format. Simplified interface, zero-based indexing mkl_cspblas_?csrgemv Computes matrix - vector product of a sparse general matrix in the CSR format (3-array variation) with zero-based indexing. mkl_cspblas_?bsrgemv Computes matrix - vector product of a sparse general matrix in the BSR format (3-array variation)with zero-based indexing. mkl_cspblas_?coogemv Computes matrix - vector product of a sparse general matrix in the coordinate format with zero-based indexing. mkl_cspblas_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix in the CSR format (3-array variation) with zero-based indexing mkl_cspblas_?bsrsymv Computes matrix - vector product of a sparse symmetrical matrix in the BSR format (3-array variation) with zero-based indexing. mkl_cspblas_?coosymv Computes matrix - vector product of a sparse symmetrical matrix in the coordinate format with zero-based indexing. mkl_cspblas_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with zero-based indexing. mkl_cspblas_?bsrtrsv Triangular solver with simplified interface for a sparse matrix in the BSR format (3-array variation) with zero-based indexing. mkl_cspblas_?cootrsv Triangular solver with simplified interface for a sparse matrix in the coordinate format with zero-based indexing. Typical (conventional) interface, one-based and zero-based indexing mkl_?csrmv Computes matrix - vector product of a sparse matrix in the CSR format. mkl_?bsrmv Computes matrix - vector product of a sparse matrix in the BSR format. mkl_?cscmv Computes matrix - vector product for a sparse matrix in the CSC format. mkl_?coomv Computes matrix - vector product for a sparse matrix in the coordinate format. mkl_?csrsv Solves a system of linear equations for a sparse matrix in the CSR format. BLAS and Sparse BLAS Routines 2 159 Routine/Function Description mkl_?bsrsv Solves a system of linear equations for a sparse matrix in the BSR format. mkl_?cscsv Solves a system of linear equations for a sparse matrix in the CSC format. mkl_?coosv Solves a system of linear equations for a sparse matrix in the coordinate format. mkl_?csrmm Computes matrix - matrix product of a sparse matrix in the CSR format mkl_?bsrmm Computes matrix - matrix product of a sparse matrix in the BSR format. mkl_?cscmm Computes matrix - matrix product of a sparse matrix in the CSC format mkl_?coomm Computes matrix - matrix product of a sparse matrix in the coordinate format. mkl_?csrsm Solves a system of linear matrix equations for a sparse matrix in the CSR format. mkl_?bsrsm Solves a system of linear matrix equations for a sparse matrix in the BSR format. mkl_?cscsm Solves a system of linear matrix equations for a sparse matrix in the CSC format. mkl_?coosm Solves a system of linear matrix equations for a sparse matrix in the coordinate format. Typical (conventional) interface, one-based indexing mkl_?diamv Computes matrix - vector product of a sparse matrix in the diagonal format. mkl_?skymv Computes matrix - vector product for a sparse matrix in the skyline storage format. mkl_?diasv Solves a system of linear equations for a sparse matrix in the diagonal format. mkl_?skysv Solves a system of linear equations for a sparse matrix in the skyline format. mkl_?diamm Computes matrix - matrix product of a sparse matrix in the diagonal format. mkl_?skymm Computes matrix - matrix product of a sparse matrix in the skyline storage format. mkl_?diasm Solves a system of linear matrix equations for a sparse matrix in the diagonal format. mkl_?skysm Solves a system of linear matrix equations for a sparse matrix in the skyline storage format. Auxiliary routines Matrix converters 2 Intel® Math Kernel Library Reference Manual 160 Routine/Function Description mkl_?dnscsr Converts a sparse matrix in the dense representation to the CSR format (3-array variation). mkl_?csrcoo Converts a sparse matrix in the CSR format (3-array variation) to the coordinate format and vice versa. mkl_?csrbsr Converts a sparse matrix in the CSR format to the BSR format (3-array variations) and vice versa. mkl_?csrcsc Converts a sparse matrix in the CSR format to the CSC and vice versa (3-array variations). mkl_?csrdia Converts a sparse matrix in the CSR format (3-array variation) to the diagonal format and vice versa. mkl_?csrsky Converts a sparse matrix in the CSR format (3-array variation) to the sky line format and vice versa. Operations on sparse matrices mkl_?csradd Computes the sum of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. mkl_?csrmultcsr Computes the product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. mkl_?csrmultd Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. The result is stored in the dense matrix. mkl_?csrgemv Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrgemv(transa, m, a, ia, ja, x, y) call mkl_dcsrgemv(transa, m, a, ia, ja, x, y) call mkl_ccsrgemv(transa, m, a, ia, ja, x, y) call mkl_zcsrgemv(transa, m, a, ia, ja, x, y) C: mkl_scsrgemv(&transa, &m, a, ia, ja, x, y); mkl_dcsrgemv(&transa, &m, a, ia, ja, x, y); mkl_ccsrgemv(&transa, &m, a, ia, ja, x, y); mkl_zcsrgemv(&transa, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 161 Description The mkl_?csrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the CSR format (3-array variation), A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then as y := A*x If transa = 'T' or 't' or 'C' or 'c', then y := A'*x, m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zcsrgemv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scsrgemv. DOUBLE PRECISION for mkl_dcsrgemv. COMPLEX for mkl_ccsrgemv. 2 Intel® Math Kernel Library Reference Manual 162 DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrgemv(char *transa, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrgemv(char *transa, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrgemv(char *transa, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrgemv(char *transa, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); BLAS and Sparse BLAS Routines 2 163 mkl_?bsrgemv Computes matrix - vector product of a sparse general matrix stored in the BSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_sbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_dbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_zbsrgemv(transa, m, lb, a, ia, ja, x, y) C: mkl_sbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_dbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_zbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m block sparse square matrix in the BSR format (3-array variation), A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of block rows of the matrix A. 2 Intel® Math Kernel Library Reference Manual 164 lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrgemv. DOUBLE PRECISION for mkl_dbsrgemv. COMPLEX for mkl_cbsrgemv. DOUBLE COMPLEX for mkl_zbsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 165 SUBROUTINE mkl_cbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_dbsrgemv(char *transa, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_sbsrgemv(char *transa, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cbsrgemv(char *transa, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrgemv(char *transa, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coogemv Computes matrix-vector product of a sparse general matrix stored in the coordinate format with one-based indexing. Syntax Fortran: call mkl_scoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) C: mkl_scoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_dcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_ccoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_zcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h 2 Intel® Math Kernel Library Reference Manual 166 Description The mkl_?coogemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the coordinate format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of rows of the matrix A. val REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array, DIMENSION is m. One entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scoogemv. DOUBLE PRECISION for mkl_dcoogemv. BLAS and Sparse BLAS Routines 2 167 COMPLEX for mkl_ccoogemv. DOUBLE COMPLEX for mkl_zcoogemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoogemv(char *transa, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoogemv(char *transa, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoogemv(char *transa, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoogemv(char *transa, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); 2 Intel® Math Kernel Library Reference Manual 168 mkl_?diagemv Computes matrix - vector product of a sparse general matrix stored in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_ddiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_cdiagemv(transa, m, val, lval, idiag, ndiag, x, y) call mkl_zdiagemv(transa, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiagemv(&transa, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diagemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the diagonal storage format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := A*x If transa = 'T' or 't' or 'C' or 'c', then y := A'*x, m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. BLAS and Sparse BLAS Routines 2 169 COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Two-dimensional array of size lval*ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiagemv. DOUBLE PRECISION for mkl_ddiagemv. COMPLEX for mkl_ccsrgemv. DOUBLE COMPLEX for mkl_zdiagemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 170 SUBROUTINE mkl_zdiagemv(transa, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiagemv(char *transa, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiagemv(char *transa, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiagemv(char *transa, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiagemv(char *transa, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrsymv Computes matrix - vector product of a sparse symmetrical matrix stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrsymv(uplo, m, a, ia, ja, x, y) call mkl_dcsrsymv(uplo, m, a, ia, ja, x, y) call mkl_ccsrsymv(uplo, m, a, ia, ja, x, y) call mkl_zcsrsymv(uplo, m, a, ia, ja, x, y) C: mkl_scsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_dcsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_ccsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_zcsrsymv(&uplo, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, BLAS and Sparse BLAS Routines 2 171 A is an upper or lower triangle of the symmetrical sparse matrix in the CSR format (3-array variation). NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scsrsymv. DOUBLE PRECISION for mkl_dcsrsymv. COMPLEX for mkl_ccsrsymv. DOUBLE COMPLEX for mkl_zcsrsymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. 2 Intel® Math Kernel Library Reference Manual 172 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrsymv(char *uplo, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrsymv(char *uplo, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrsymv(char *uplo, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrsymv(char *uplo, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?bsrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the BSR format (3-array variation) with one-based indexing. BLAS and Sparse BLAS Routines 2 173 Syntax Fortran: call mkl_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) C: mkl_sbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_dbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_zbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the BSR format (3-array variation). NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. 2 Intel® Math Kernel Library Reference Manual 174 Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrsymv. DOUBLE PRECISION for mkl_dbsrsymv. COMPLEX for mkl_cbsrsymv. DOUBLE COMPLEX for mkl_zcsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 175 SUBROUTINE mkl_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_sbsrsymv(char *uplo, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_dbsrsymv(char *uplo, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cbsrsymv(char *uplo, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrsymv(char *uplo, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coosymv Computes matrix - vector product of a sparse symmetrical matrix stored in the coordinate format with one-based indexing. Syntax Fortran: call mkl_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) C: mkl_scoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_dcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_ccoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_zcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coosymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, 2 Intel® Math Kernel Library Reference Manual 176 A is an upper or lower triangle of the symmetrical sparse matrix in the coordinate format. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scoosymv. DOUBLE PRECISION for mkl_dcoosymv. COMPLEX for mkl_ccoosymv. DOUBLE COMPLEX for mkl_zcoosymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. BLAS and Sparse BLAS Routines 2 177 Interfaces FORTRAN 77: SUBROUTINE mkl_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cdcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoosymv(char *uplo, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoosymv(char *uplo, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoosymv(char *uplo, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoosymv(char *uplo, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diasymv Computes matrix - vector product of a sparse symmetrical matrix stored in the diagonal format with one-based indexing. 2 Intel® Math Kernel Library Reference Manual 178 Syntax Fortran: call mkl_sdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_ddiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_cdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) call mkl_zdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiasymv(&uplo, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval =m. Refer to lval description in Diagonal Storage Scheme for more details. BLAS and Sparse BLAS Routines 2 179 idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiasymv. DOUBLE PRECISION for mkl_ddiasymv. COMPLEX for mkl_cdiasymv. DOUBLE COMPLEX for mkl_zdiasymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiasymv(uplo, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo INTEGER m, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 180 C: void mkl_sdiasymv(char *uplo, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiasymv(char *uplo, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiasymv(char *uplo, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiasymv(char *uplo, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with onebased indexing. Syntax Fortran: call mkl_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) C: mkl_scsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_dcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_ccsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_zcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the CSR format (3 array variation): A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 181 NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is a unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. a REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. NOTE Column indices must be sorted in increasing order for each row. x REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array, DIMENSION is m. On entry, the array x must contain the vector x. 2 Intel® Math Kernel Library Reference Manual 182 Output Parameters y REAL for mkl_scsrtrmv. DOUBLE PRECISION for mkl_dcsrtrmv. COMPLEX for mkl_ccsrtrmv. DOUBLE COMPLEX for mkl_zcsrtrmv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_scsrtrsv(char *uplo, char *transa, char *diag, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_dcsrtrsv(char *uplo, char *transa, char *diag, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_ccsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); BLAS and Sparse BLAS Routines 2 183 mkl_?bsrtrsv Triangular solver with simplified interface for a sparse matrix stored in the BSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) C: mkl_sbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_dbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_zbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the BSR format (3-array variation) : y := A*x or y := A'*x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x 2 Intel® Math Kernel Library Reference Manual 184 If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. diag CHARACTER*1. Specifies whether A is a unit triangular matrix. If diag = 'U' or 'u', then A is a unit triangular. If diag = 'N' or 'n', then A is not a unit triangular. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sbsrtrsv. DOUBLE PRECISION for mkl_dbsrtrsv. COMPLEX for mkl_cbsrtrsv. DOUBLE COMPLEX for mkl_zbsrtrsv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. BLAS and Sparse BLAS Routines 2 185 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_sbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_dbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format with one-based indexing. 2 Intel® Math Kernel Library Reference Manual 186 Syntax Fortran: call mkl_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) C: mkl_scootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_dcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_ccootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_zcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cootrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the coordinate format: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. BLAS and Sparse BLAS Routines 2 187 m INTEGER. Number of rows of the matrix A. val REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_scootrsv. DOUBLE PRECISION for mkl_dcootrsv. COMPLEX for mkl_ccootrsv. DOUBLE COMPLEX for mkl_zcootrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 188 SUBROUTINE mkl_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scootrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, double *y); void mkl_dcootrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diatrsv Triangular solvers with simplified interface for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_ddiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_cdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) call mkl_zdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) C: mkl_sdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_ddiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_cdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); mkl_zdiatrsv(&uplo, &transa, &diag, &m, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 189 Description The mkl_?diatrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the diagonal format: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. val REAL for mkl_sdiatrsv. DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiatrsv. 2 Intel® Math Kernel Library Reference Manual 190 DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_sdiatrsv. DOUBLE PRECISION for mkl_ddiatrsv. COMPLEX for mkl_cdiatrsv. DOUBLE COMPLEX for mkl_zdiatrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiatrsv(uplo, transa, diag, m, val, lval, idiag, ndiag, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiatrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiatrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); BLAS and Sparse BLAS Routines 2 191 void mkl_cdiatrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiatrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?csrgemv Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_dcsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_ccsrgemv(transa, m, a, ia, ja, x, y) call mkl_cspblas_zcsrgemv(transa, m, a, ia, ja, x, y) C: mkl_cspblas_scsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_dcsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_ccsrgemv(&transa, &m, a, ia, ja, x, y); mkl_cspblas_zcsrgemv(&transa, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the CSR format (3-array variation) with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. 2 Intel® Math Kernel Library Reference Manual 192 transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array, DIMENSION is m. One entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrgemv. DOUBLE PRECISION for mkl_cspblas_dcsrgemv. COMPLEX for mkl_cspblas_ccsrgemv. DOUBLE COMPLEX for mkl_cspblas_zcsrgemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 193 SUBROUTINE mkl_cspblas_dcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrgemv(transa, m, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrgemv(char *transa, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrgemv(char *transa, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_ccsrgemv(char *transa, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrgemv(char *transa, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrgemv Computes matrix - vector product of a sparse general matrix stored in the BSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_sbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrgemv(transa, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrgemv(transa, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); 2 Intel® Math Kernel Library Reference Manual 194 mkl_cspblas_cbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrgemv(&transa, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrgemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m block sparse square matrix in the BSR format (3-array variation) with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x, m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. BLAS and Sparse BLAS Routines 2 195 x REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrgemv. DOUBLE PRECISION for mkl_cspblas_dbsrgemv. COMPLEX for mkl_cspblas_cbsrgemv. DOUBLE COMPLEX for mkl_cspblas_zbsrgemv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrgemv(transa, m, lb, a, ia, ja, x, y) CHARACTER*1 transa INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrgemv(char *transa, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); 2 Intel® Math Kernel Library Reference Manual 196 void mkl_cspblas_dbsrgemv(char *transa, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrgemv(char *transa, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrgemv(char *transa, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?coogemv Computes matrix - vector product of a sparse general matrix stored in the coordinate format with zerobased indexing. Syntax Fortran: call mkl_cspblas_scoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_ccoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcoogemv(&transa, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_dcoogemv routine performs a matrix-vector operation defined as y := A*x or y := A'*x, where: x and y are vectors, A is an m-by-m sparse square matrix in the coordinate format with zero-based indexing, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. BLAS and Sparse BLAS Routines 2 197 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scoogemv. DOUBLE PRECISION for mkl_cspblas_dcoogemv. COMPLEX for mkl_cspblas_ccoogemv. DOUBLE COMPLEX for mkl_cspblas_zcoogemv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 198 SUBROUTINE mkl_cspblas_dcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcoogemv(transa, m, val, rowind, colind, nnz, x, y) CHARACTER*1 transa INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_cspblas_scoogemv(char *transa, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcoogemv(char *transa, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccoogemv(char *transa, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcoogemv(char *transa, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?csrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_dcsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_ccsrsymv(uplo, m, a, ia, ja, x, y) call mkl_cspblas_zcsrsymv(uplo, m, a, ia, ja, x, y) C: mkl_cspblas_scsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_cspblas_dcsrsymv(&uplo, &m, a, ia, ja, x, y); BLAS and Sparse BLAS Routines 2 199 mkl_cspblas_ccsrsymv(&uplo, &m, a, ia, ja, x, y); mkl_cspblas_zcsrsymv(&uplo, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the CSR format (3-array variation) with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. x REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. 2 Intel® Math Kernel Library Reference Manual 200 DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrsymv. DOUBLE PRECISION for mkl_cspblas_dcsrsymv. COMPLEX for mkl_cspblas_ccsrsymv. DOUBLE COMPLEX for mkl_cspblas_zcsrsymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrsymv(uplo, m, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrsymv(char *uplo, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrsymv(char *uplo, int *m, double *a, int *ia, int *ja, double *x, double *y); BLAS and Sparse BLAS Routines 2 201 void mkl_cspblas_ccsrsymv(char *uplo, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrsymv(char *uplo, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrsymv Computes matrix-vector product of a sparse symmetrical matrix stored in the BSR format (3-arrays variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_cbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrsymv(&uplo, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrsymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the BSR format (3-array variation) with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. 2 Intel® Math Kernel Library Reference Manual 202 uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrsymv. DOUBLE PRECISION for mkl_cspblas_dbsrsymv. COMPLEX for mkl_cspblas_cbsrsymv. DOUBLE COMPLEX for mkl_cspblas_zbsrsymv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 203 SUBROUTINE mkl_cspblas_dbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrsymv(uplo, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrsymv(char *uplo, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dbsrsymv(char *uplo, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrsymv(char *uplo, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrsymv(char *uplo, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?coosymv Computes matrix - vector product of a sparse symmetrical matrix stored in the coordinate format with zero-based indexing . Syntax Fortran: call mkl_cspblas_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); 2 Intel® Math Kernel Library Reference Manual 204 mkl_cspblas_ccoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcoosymv(&uplo, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?coosymv routine performs a matrix-vector operation defined as y := A*x where: x and y are vectors, A is an upper or lower triangle of the symmetrical sparse matrix in the coordinate format with zero-based indexing. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array, DIMENSION is m. On entry, the array x must contain the vector x. BLAS and Sparse BLAS Routines 2 205 Output Parameters y REAL for mkl_cspblas_scoosymv. DOUBLE PRECISION for mkl_cspblas_dcoosymv. COMPLEX for mkl_cspblas_ccoosymv. DOUBLE COMPLEX for mkl_cspblas_zcoosymv. Array, DIMENSION at least m. On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcoosymv(uplo, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_cspblas_scoosymv(char *uplo, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcoosymv(char *uplo, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccoosymv(char *uplo, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcoosymv(char *uplo, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); 2 Intel® Math Kernel Library Reference Manual 206 mkl_cspblas_?csrtrsv Triangular solvers with simplified interface for a sparse matrix in the CSR format (3-array variation) with zero-based indexing. Syntax Fortran: call mkl_cspblas_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) call mkl_cspblas_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) C: mkl_cspblas_scsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_dcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_ccsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); mkl_cspblas_zcsrtrsv(&uplo, &transa, &diag, &m, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?csrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the CSR format (3-array variation) with zero-based indexing: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x BLAS and Sparse BLAS Routines 2 207 If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether matrix A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. a REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length m+1, containing indices of elements in the array a, such that ia(i) is the index in the array a of the first non-zero element from the row i. The value of the last element ia(m) is equal to the number of non-zeros. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. NOTE Column indices must be sorted in increasing order for each row. x REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scsrtrsv. DOUBLE PRECISION for mkl_cspblas_dcsrtrsv. COMPLEX for mkl_cspblas_ccsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zcsrtrsv. Array, DIMENSION at least m. Contains the vector y. 2 Intel® Math Kernel Library Reference Manual 208 Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcsrtrsv(uplo, transa, diag, m, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_scsrtrsv(char *uplo, char *transa, char *diag, int *m, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dcsrtrsv(char *uplo, char *transa, char *diag, int *m, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_ccsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcsrtrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?bsrtrsv Triangular solver with simplified interface for a sparse matrix stored in the BSR format (3-array variation) with zero-based indexing. BLAS and Sparse BLAS Routines 2 209 Syntax Fortran: call mkl_cspblas_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) call mkl_cspblas_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) C: mkl_cspblas_sbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_dbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_cbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); mkl_cspblas_zbsrtrsv(&uplo, &transa, &diag, &m, &lb, a, ia, ja, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?bsrtrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the BSR format (3-array variation) with zero-based indexing: y := A*x or y := A'*x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies the upper or low triangle of the matrix A is used. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := A*x If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := A'*x. diag CHARACTER*1. Specifies whether matrix A is unit triangular or not. If diag = 'U' or 'u', A is unit triangular. 2 Intel® Math Kernel Library Reference Manual 210 If diag = 'N' or 'n', A is not unit triangular. m INTEGER. Number of block rows of the matrix A. lb INTEGER. Size of the block in the matrix A. a REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. ia INTEGER. Array of length (m + 1), containing indices of block in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero blocks. Refer to rowIndex array description in BSR Format for more details. ja INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. x REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array, DIMENSION (m*lb). On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_sbsrtrsv. DOUBLE PRECISION for mkl_cspblas_dbsrtrsv. COMPLEX for mkl_cspblas_cbsrtrsv. DOUBLE COMPLEX for mkl_cspblas_zbsrtrsv. Array, DIMENSION at least (m*lb). On exit, the array y must contain the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_sbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) REAL a(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 211 SUBROUTINE mkl_cspblas_dbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE PRECISION a(*), x(*), y(*) SUBROUTINE mkl_cspblas_cbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) COMPLEX a(*), x(*), y(*) SUBROUTINE mkl_cspblas_zbsrtrsv(uplo, transa, diag, m, lb, a, ia, ja, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, lb INTEGER ia(*), ja(*) DOUBLE COMPLEX a(*), x(*), y(*) C: void mkl_cspblas_sbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, float *a, int *ia, int *ja, float *x, float *y); void mkl_cspblas_dbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, double *a, int *ia, int *ja, double *x, double *y); void mkl_cspblas_cbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex8 *a, int *ia, int *ja, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zbsrtrsv(char *uplo, char *transa, char *diag, int *m, int *lb, MKL_Complex16 *a, int *ia, int *ja, MKL_Complex16 *x, MKL_Complex16 *y); mkl_cspblas_?cootrsv Triangular solvers with simplified interface for a sparse matrix in the coordinate format with zero-based indexing . Syntax Fortran: call mkl_cspblas_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) call mkl_cspblas_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) C: mkl_cspblas_scootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_dcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); 2 Intel® Math Kernel Library Reference Manual 212 mkl_cspblas_ccootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); mkl_cspblas_zcootrsv(&uplo, &transa, &diag, &m, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_cspblas_?cootrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the coordinate format with zero-based indexing: A*y = x or A'*y = x, where: x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only zero-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. uplo CHARACTER*1. Specifies whether the upper or low triangle of the matrix A is considered. If uplo = 'U' or 'u', then the upper triangle of the matrix A is used. If uplo = 'L' or 'l', then the low triangle of the matrix A is used. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then A*y = x If transa = 'T' or 't' or 'C' or 'c', then A'*y = x, diag CHARACTER*1. Specifies whether A is unit triangular. If diag = 'U' or 'u', then A is unit triangular. If diag = 'N' or 'n', then A is not unit triangular. m INTEGER. Number of rows of the matrix A. val REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. BLAS and Sparse BLAS Routines 2 213 colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array, DIMENSION is m. On entry, the array x must contain the vector x. Output Parameters y REAL for mkl_cspblas_scootrsv. DOUBLE PRECISION for mkl_cspblas_dcootrsv. COMPLEX for mkl_cspblas_ccootrsv. DOUBLE COMPLEX for mkl_cspblas_zcootrsv. Array, DIMENSION at least m. Contains the vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_cspblas_scootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) REAL val(*), x(*), y(*) SUBROUTINE mkl_cspblas_dcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cspblas_ccootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_cspblas_zcootrsv(uplo, transa, diag, m, val, rowind, colind, nnz, x, y) CHARACTER*1 uplo, transa, diag INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 214 C: void mkl_cspblas_scootrsv(char *uplo, char *transa, char *diag, int *m, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_cspblas_dcootrsv(char *uplo, char *transa, char *diag, int *m, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_cspblas_ccootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_cspblas_zcootrsv(char *uplo, char *transa, char *diag, int *m, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrmv Computes matrix - vector product of a sparse matrix stored in the CSR format. Syntax Fortran: call mkl_scsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_ccsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_scsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dcsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_ccsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zcsrmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in the CSR format, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 215 NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A.Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. x REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. 2 Intel® Math Kernel Library Reference Manual 216 Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Specifies the scalar beta. y REAL for mkl_scsrmv. DOUBLE PRECISION for mkl_dcsrmv. COMPLEX for mkl_ccsrmv. DOUBLE COMPLEX for mkl_zcsrmv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 217 SUBROUTINE mkl_zcsrmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scsrmv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); void mkl_dcsrmv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_ccsrmv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, double *y); void mkl_zcsrmv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?bsrmv Computes matrix - vector product of a sparse matrix stored in the BSR format. Syntax Fortran: call mkl_sbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_cbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_sbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_cbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zbsrmv(&transa, &m, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); 2 Intel® Math Kernel Library Reference Manual 218 Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k block sparse matrix in the BSR format, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-vector product is computed as y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as y := alpha*A'*x + beta*y, m INTEGER. Number of block rows of the matrix A. k INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. BLAS and Sparse BLAS Routines 2 219 Refer to values array description in BSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. Refer to pointerE array description in BSR Format for more details. x REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array, DIMENSION at least (k*lb) if transa = 'N' or 'n', and at least (m*lb) otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Specifies the scalar beta. y REAL for mkl_sbsrmv. DOUBLE PRECISION for mkl_dbsrmv. COMPLEX for mkl_cbsrmv. DOUBLE COMPLEX for mkl_zbsrmv. Array, DIMENSION at least (m*lb) if transa = 'N' or 'n', and at least (k*lb) otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. 2 Intel® Math Kernel Library Reference Manual 220 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_cbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zbsrmv(transa, m, k, lb, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sbsrmv(char *transa, int *m, int *k, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); BLAS and Sparse BLAS Routines 2 221 void mkl_dbsrmv(char *transa, int *m, int *k, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_cbsrmv(char *transa, int *m, int *k, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zbsrmv(char *transa, int *m, int *k, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?cscmv Computes matrix-vector product for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_dcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_ccscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) call mkl_zcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) C: mkl_scscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_dcscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_ccscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); mkl_zcscmv(&transa, &m, &k, &alpha, matdescra, val, indx, pntrb, pntre, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscmv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in compressed sparse column (CSC) format, A' is the transpose of A. 2 Intel® Math Kernel Library Reference Manual 222 NOTE This routine supports CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. pntrb INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. BLAS and Sparse BLAS Routines 2 223 x REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Specifies the scalar beta. y REAL for mkl_scscmv. DOUBLE PRECISION for mkl_dcscmv. COMPLEX for mkl_ccscmv. DOUBLE COMPLEX for mkl_zcscmv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 224 SUBROUTINE mkl_ccscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcscmv(transa, m, k, alpha, matdescra, val, indx, pntrb, pntre, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scscmv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *beta, float *y); void mkl_dcscmv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *beta, double *y); void mkl_ccscmv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zcscmv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?coomv Computes matrix - vector product for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) call mkl_dcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) call mkl_ccoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) BLAS and Sparse BLAS Routines 2 225 call mkl_zcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) C: mkl_scoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_dcoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_ccoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); mkl_zcoomv(&transa, &m, &k, &alpha, matdescra, val, rowind, colind, &nnz, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coomv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix in compressed coordinate format, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. 2 Intel® Math Kernel Library Reference Manual 226 Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Specifies the scalar beta. y REAL for mkl_scoomv. DOUBLE PRECISION for mkl_dcoomv. COMPLEX for mkl_ccoomv. DOUBLE COMPLEX for mkl_zcoomv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_scoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) REAL alpha, beta REAL val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 227 SUBROUTINE mkl_dcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) SUBROUTINE mkl_ccoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zcoomv(transa, m, k, alpha, matdescra, val, rowind, colind, nnz, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_scoomv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *x, float *beta, float *y); void mkl_dcoomv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *x, double *beta, double *y); void mkl_ccoomv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zcoomv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?csrsv Solves a system of linear equations for a sparse matrix in the CSR format. Syntax Fortran: call mkl_scsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) 2 Intel® Math Kernel Library Reference Manual 228 call mkl_dcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_ccsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_zcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) C: mkl_scsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dcsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_ccsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zcsrsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the CSR format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')*x, m INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. BLAS and Sparse BLAS Routines 2 229 val REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. NOTE Column indices must be sorted in increasing order for each row. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. x REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scsrsv. DOUBLE PRECISION for mkl_dcsrsv. COMPLEX for mkl_ccsrsv. DOUBLE COMPLEX for mkl_zcsrsv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. 2 Intel® Math Kernel Library Reference Manual 230 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcsrsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_scsrsv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); BLAS and Sparse BLAS Routines 2 231 void mkl_dcsrsv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_ccsrsv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcsrsv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?bsrsv Solves a system of linear equations for a sparse matrix in the BSR format. Syntax Fortran: call mkl_sbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_dbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_cbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) call mkl_zbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) C: mkl_sbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_cbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zbsrsv(&transa, &m, &lb, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the BSR format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. 2 Intel® Math Kernel Library Reference Manual 232 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. BLAS and Sparse BLAS Routines 2 233 Refer to pointerE array description in BSR Format for more details. x REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array, DIMENSION at least (m*lb). On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sbsrsv. DOUBLE PRECISION for mkl_dbsrsv. COMPLEX for mkl_cbsrsv. DOUBLE COMPLEX for mkl_zbsrsv. Array, DIMENSION at least (m*lb). On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 234 SUBROUTINE mkl_cbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zbsrsv(transa, m, lb, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lb INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_sbsrsv(char *transa, int *m, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); void mkl_dbsrsv(char *transa, int *m, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_cbsrsv(char *transa, int *m, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zbsrsv(char *transa, int *m, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?cscsv Solves a system of linear equations for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_dcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_ccscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv call mkl_zcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y)call mkl_dcscsv BLAS and Sparse BLAS Routines 2 235 C: mkl_scscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_dcscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_ccscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); mkl_zcscsv(&transa, &m, &alpha, matdescra, val, indx, pntrb, pntre, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscsv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the CSC format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa= 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. 2 Intel® Math Kernel Library Reference Manual 236 Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSC Format for more details. NOTE Row indices must be sorted in increasing order for each column. pntrb INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. x REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scscsv. DOUBLE PRECISION for mkl_dcscsv. COMPLEX for mkl_ccscsv. DOUBLE COMPLEX for mkl_zcscsv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains the solution vector x. BLAS and Sparse BLAS Routines 2 237 Interfaces FORTRAN 77: SUBROUTINE mkl_scscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcscsv(transa, m, alpha, matdescra, val, indx, pntrb, pntre, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) C: void mkl_scscsv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *x, float *y); 2 Intel® Math Kernel Library Reference Manual 238 void mkl_dcscsv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *x, double *y); void mkl_ccscsv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcscsv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?coosv Solves a system of linear equations for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_dcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_ccoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) call mkl_zcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) C: mkl_scoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_dcoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_ccoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); mkl_zcoosv(&transa, &m, &alpha, matdescra, val, rowind, colind, &nnz, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coosv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the coordinate format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. BLAS and Sparse BLAS Routines 2 239 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. x REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_scoosv. DOUBLE PRECISION for mkl_dcoosv. COMPLEX for mkl_ccoosv. DOUBLE COMPLEX for mkl_zcoosv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. 2 Intel® Math Kernel Library Reference Manual 240 Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_scoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) REAL alpha REAL val(*) REAL x(*), y(*) SUBROUTINE mkl_dcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*) DOUBLE PRECISION x(*), y(*) SUBROUTINE mkl_ccoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) COMPLEX alpha COMPLEX val(*) COMPLEX x(*), y(*) SUBROUTINE mkl_zcoosv(transa, m, alpha, matdescra, val, rowind, colind, nnz, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*) DOUBLE COMPLEX x(*), y(*) BLAS and Sparse BLAS Routines 2 241 C: void mkl_scoosv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *x, float *y); void mkl_dcoosv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *x, double *y); void mkl_ccoosv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zcoosv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?csrmm Computes matrix - matrix product of a sparse matrix stored in the CSR format. Syntax Fortran: call mkl_scsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_ccsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_scsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dcsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_ccsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zcsrmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C 2 Intel® Math Kernel Library Reference Manual 242 or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in compressed sparse row (CSR) format, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(—1) - pntrb(0). Refer to values array description in CSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(I) - pntrb(1)+1 is the first index of row I in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(I) - pntrb(0) is the first index of row I in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. BLAS and Sparse BLAS Routines 2 243 pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(I) - pntrb(1) is the last index of row I in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(I) - pntrb(0)-1 is the last index of row I in the arrays val and indx. Refer to pointerE array description in CSR Format for more details. b REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa= 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Specifies the scalar beta. c REAL for mkl_scsrmm. DOUBLE PRECISION for mkl_dcsrmm. COMPLEX for mkl_ccsrmm. DOUBLE COMPLEX for mkl_zcsrmm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta* C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 244 Interfaces FORTRAN 77: SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scsrmm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc,); BLAS and Sparse BLAS Routines 2 245 void mkl_dcsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_ccsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_zcsrmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); mkl_?bsrmm Computes matrix - matrix product of a sparse matrix stored in the BSR format. Syntax Fortran: call mkl_sbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_cbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_sbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_cbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zbsrmm(&transa, &m, &n, &k, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, 2 Intel® Math Kernel Library Reference Manual 246 where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in block sparse row (BSR) format, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*A'*B + beta*C, m INTEGER. Number of block rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of block columns of the matrix A. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(I) - pntrb(1)+1 is the first index of block row I in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(I) - pntrb(0) is the first index of block row I in the array indx. BLAS and Sparse BLAS Routines 2 247 Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(I) - pntrb(1) is the last index of block row I in the array indx. For zero-based indexing this array contains row indices, such that pntre(I) - pntrb(0)-1 is the last index of block row I in the array indx. Refer to pointerE array description in BSR Format for more details. b REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa= 'N' or 'n', the leading n-by-k block part of the array b must contain the matrix B, otherwise the leading m-by-n block part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension (in blocks) of b as declared in the calling (sub)program. beta REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Specifies the scalar beta. c REAL for mkl_sbsrmm. DOUBLE PRECISION for mkl_dbsrmm. COMPLEX for mkl_cbsrmm. DOUBLE COMPLEX for mkl_zbsrmm. Array, DIMENSION (ldc, n) for one-based indexing, DIMENSION (k, ldc) for zero-based indexing. On entry, the leading m-by-n block part of the array c must contain the matrix C, otherwise the leading n-by-k block part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension (in blocks) of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 248 Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zbsrmm(transa, m, n, k, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ld, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sbsrmm(char *transa, int *m, int *n, int *k, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc,); BLAS and Sparse BLAS Routines 2 249 void mkl_dbsrmm(char *transa, int *m, int *n, int *k, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc,); void mkl_cbsrmm(char *transa, int *m, int *n, int *k, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc,); void mkl_zbsrmm(char *transa, int *m, int *n, int *k, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc,); mkl_?cscmm Computes matrix-matrix product of a sparse matrix stored in the CSC format. Syntax Fortran: call mkl_scscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_dcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_ccscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) call mkl_zcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) C: mkl_scscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_dcscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_ccscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); mkl_zcscmm(&transa, &m, &n, &k, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscmm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, 2 Intel® Math Kernel Library Reference Manual 250 where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in compressed sparse column (CSC) format, A' is the transpose of A. NOTE This routine supports CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A* B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A.Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. pntrb INTEGER. Array of length k. For one-based indexing this array contains column indices, such that pntrb(i) - pntrb(1)+1 is the first index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(i) - pntrb(0) is the first index of column i in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length k. BLAS and Sparse BLAS Routines 2 251 For one-based indexing this array contains column indices, such that pntre(i) - pntrb(1) is the last index of column i in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(i) - pntrb(1)-1 is the last index of column i in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. b REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL*8. Specifies the scalar beta. c REAL for mkl_scscmm. DOUBLE PRECISION for mkl_dcscmm. COMPLEX for mkl_ccscmm. DOUBLE COMPLEX for mkl_zcscmm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta* C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_scscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 252 SUBROUTINE mkl_dcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcscmm(transa, m, n, k, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER indx(*), pntrb(k), pntre(k) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scscmm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dcscmm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_ccscmm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); BLAS and Sparse BLAS Routines 2 253 void mkl_zcscmm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?coomm Computes matrix-matrix product of a sparse matrix stored in the coordinate format. Syntax Fortran: call mkl_scoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_dcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_ccoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) call mkl_zcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) C: mkl_scoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_dcoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_ccoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); mkl_zcoomm(&transa, &m, &n, &k, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?coomm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in the coordinate format, A' is the transpose of A. 2 Intel® Math Kernel Library Reference Manual 254 NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. b REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array, DIMENSION (ldb, at least n for non-transposed matrix A and at least m for transposed) for one-based indexing, and (at least k for nontransposed matrix A and at least m for transposed, ldb) for zero-based indexing. BLAS and Sparse BLAS Routines 2 255 On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. beta REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Specifies the scalar beta. c REAL for mkl_scoomm. DOUBLE PRECISION for mkl_dcoomm. COMPLEX for mkl_ccoomm. DOUBLE COMPLEX for mkl_zcoomm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_scoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 256 SUBROUTINE mkl_ccoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcoomm(transa, m, n, k, alpha, matdescra, val, rowind, colind, nnz, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scoomm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dcoomm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_ccoomm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zcoomm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?csrsm Solves a system of linear matrix equations for a sparse matrix in the CSR format. Syntax Fortran: call mkl_scsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) BLAS and Sparse BLAS Routines 2 257 call mkl_ccsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) C: mkl_scsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_dcsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcsrsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the CSR format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a CSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of columns of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Specifies the scalar alpha. 2 Intel® Math Kernel Library Reference Manual 258 matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(m) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to columns array description in CSR Format for more details. NOTE Column indices must be sorted in increasing order for each row. pntrb INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of row i in the arrays val and indx. Refer to pointerb array description in CSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of row i in the arrays val and indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of row i in the arrays val and indx.Refer to pointerE array description in CSR Format for more details. b REAL for mkl_scsrsm. DOUBLE PRECISION for mkl_dcsrsm. COMPLEX for mkl_ccsrsm. DOUBLE COMPLEX for mkl_zcsrsm. Array, DIMENSION (ldb, n)for one-based indexing, and (m, ldb) for zero-based indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. BLAS and Sparse BLAS Routines 2 259 ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL*8. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 260 SUBROUTINE mkl_zcsrsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scsrsm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dcsrsm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_ccsrsm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcsrsm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?cscsm Solves a system of linear matrix equations for a sparse matrix in the CSC format. Syntax Fortran: call mkl_scscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_ccscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) C: mkl_scscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); BLAS and Sparse BLAS Routines 2 261 mkl_dcscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcscsm(&transa, &m, &n, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?cscsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the CSC format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a CSC format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of columns of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. 2 Intel® Math Kernel Library Reference Manual 262 DOUBLE COMPLEX for mkl_zcscsm. Array containing non-zero elements of the matrix A. For one-based indexing its length is pntre(k) - pntrb(1). For zero-based indexing its length is pntre(m-1) - pntrb(0). Refer to values array description in CSC Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to length of the val array. Refer to rows array description in CSC Format for more details. NOTE Row indices must be sorted in increasing order for each column. pntrb INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntrb(I) - pntrb(1)+1 is the first index of column I in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntrb(I) - pntrb(0) is the first index of column I in the arrays val and indx. Refer to pointerb array description in CSC Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains column indices, such that pntre(I) - pntrb(1) is the last index of column I in the arrays val and indx. For zero-based indexing this array contains column indices, such that pntre(I) - pntrb(1)-1 is the last index of column I in the arrays val and indx. Refer to pointerE array description in CSC Format for more details. b REAL for mkl_scscsm. DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Array, DIMENSION (ldb, n) for one-based indexing, and (m, ldb) for zerobased indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL for mkl_scscsm. BLAS and Sparse BLAS Routines 2 263 DOUBLE PRECISION for mkl_dcscsm. COMPLEX for mkl_ccscsm. DOUBLE COMPLEX for mkl_zcscsm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ccscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcscsm(transa, m, n, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 264 C: void mkl_scscsm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dcscsm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_ccscsm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcscsm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?coosm Solves a system of linear matrix equations for a sparse matrix in the coordinate format. Syntax Fortran: call mkl_scoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_dcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_ccoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) call mkl_zcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) C: mkl_scoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_dcoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_ccoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); mkl_zcoosm(&transa, &m, &n, &alpha, matdescra, val, rowind, colind, &nnz, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h BLAS and Sparse BLAS Routines 2 265 Description The mkl_?coosm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the coordinate format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a coordinate format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*inv(A)*B If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*inv(A')*B, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array of length nnz, contains non-zero elements of the matrix A in the arbitrary order. Refer to values array description in Coordinate Format for more details. rowind INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. 2 Intel® Math Kernel Library Reference Manual 266 colind INTEGER. Array of length nnz, contains the column indices for each nonzero element of the matrix A. Refer to columns array description in Coordinate Format for more details. nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. b REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array, DIMENSION (ldb, n) for one-based indexing, and (m, ldb) for zerobased indexing. Before entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b for one-based indexing, and the second dimension of b for zero-based indexing, as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c for one-based indexing, and the second dimension of c for zero-based indexing, as declared in the calling (sub)program. Output Parameters c REAL for mkl_scoosm. DOUBLE PRECISION for mkl_dcoosm. COMPLEX for mkl_ccoosm. DOUBLE COMPLEX for mkl_zcoosm. Array, DIMENSION (ldc, n) for one-based indexing, and (m, ldc) for zerobased indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_scoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 267 SUBROUTINE mkl_ccoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zcoosm(transa, m, n, alpha, matdescra, val, rowind, colind, nnz, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, nnz INTEGER rowind(*), colind(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_scoosm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *rowind, int *colind, int *nnz, float *b, int *ldb, float *c, int *ldc); void mkl_dcoosm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *rowind, int *colind, int *nnz, double *b, int *ldb, double *c, int *ldc); void mkl_ccoosm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *rowind, int *colind, int *nnz, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zcoosm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *rowind, int *colind, int *nnz, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?bsrsm Solves a system of linear matrix equations for a sparse matrix in the BSR format. Syntax Fortran: call mkl_scsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_dcsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_ccsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) call mkl_zcsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) 2 Intel® Math Kernel Library Reference Manual 268 C: mkl_scsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_dcsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_ccsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); mkl_zcsrsm(&transa, &m, &n, &lb, &alpha, matdescra, val, indx, pntrb, pntre, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?bsrsm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the BSR format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports a BSR format both with one-based indexing and zero-based indexing. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then the matrix-matrix product is computed as C := alpha*inv(A)*B. If transa = 'T' or 't' or 'C' or 'c', then the matrix-vector product is computed as C := alpha*inv(A')*B. m INTEGER. Number of block columns of the matrix A. n INTEGER. Number of columns of the matrix C. lb INTEGER. Size of the block in the matrix A. alpha REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Specifies the scalar alpha. BLAS and Sparse BLAS Routines 2 269 matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the ABAB number ABAB of non-zero blocks in the matrix A multiplied by lb*lb. Refer to the values array description in BSR Format for more details. NOTE The non-zero elements of the given row of the matrix must be stored in the same order as they appear in the row (from left to right). No diagonal element can be omitted from a sparse storage if the solver is called with the non-unit indicator. indx INTEGER. Array containing the column indices for each non-zero block in the matrix A. Its length is equal to the number of non-zero blocks in the matrix A. Refer to the columns array description in BSR Format for more details. pntrb INTEGER. Array of length m. For one-based indexing: this array contains row indices, such that pntrb(i) - pntrb(1)+1 is the first index of block row i in the array indx. For zero-based indexing: this array contains row indices, such that pntrb(i) - pntrb(0) is the first index of block row i in the array indx. Refer to pointerB array description in BSR Format for more details. pntre INTEGER. Array of length m. For one-based indexing this array contains row indices, such that pntre(i) - pntrb(1) is the last index of block row i in the array indx. For zero-based indexing this array contains row indices, such that pntre(i) - pntrb(0)-1 is the last index of block row i in the array indx. Refer to pointerE array description in BSR Format for more details. b REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array, DIMENSION (ldb, n) for one-based indexing, DIMENSION (m, ldb) for zero-based indexing. On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension (in blocks) of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension (in blocks) of c as declared in the calling (sub)program. 2 Intel® Math Kernel Library Reference Manual 270 Output Parameters c REAL for mkl_sbsrsm. DOUBLE PRECISION for mkl_dbsrsm. COMPLEX for mkl_cbsrsm. DOUBLE COMPLEX for mkl_zbsrsm. Array, DIMENSION (ldc, n) for one-based indexing, DIMENSION (m, ldc) for zero-based indexing. The leading m-by-n part of the array c contains the output matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_sbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zbsrsm(transa, m, n, lb, alpha, matdescra, val, indx, pntrb, pntre, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, lb, ldb, ldc INTEGER indx(*), pntrb(m), pntre(m) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 271 C: void mkl_sbsrsm(char *transa, int *m, int *n, int *lb, float *alpha, char *matdescra, float *val, int *indx, int *pntrb, int *pntre, float *b, int *ldb, float *c, int *ldc); void mkl_dbsrsm(char *transa, int *m, int *n, int *lb, double *alpha, char *matdescra, double *val, int *indx, int *pntrb, int *pntre, double *b, int *ldb, double *c, int *ldc); void mkl_cbsrsm(char *transa, int *m, int *n, int *lb, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *indx, int *pntrb, int *pntre, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zbsrsm(char *transa, int *m, int *n, int *lb, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *indx, int *pntrb, int *pntre, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?diamv Computes matrix - vector product for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_ddiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_cdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) call mkl_zdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) C: mkl_sdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_ddiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_cdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); mkl_zdiamv(&transa, &m, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diamv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, 2 Intel® Math Kernel Library Reference Manual 272 x and y are vectors, A is an m-by-k sparse matrix stored in the diagonal format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y, If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y. m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval =m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Array, DIMENSION at least k if transa = 'N' or 'n', and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. BLAS and Sparse BLAS Routines 2 273 DOUBLE COMPLEX for mkl_zdiamv. Specifies the scalar beta. y REAL for mkl_sdiamv. DOUBLE PRECISION for mkl_ddiamv. COMPLEX for mkl_cdiamv. DOUBLE COMPLEX for mkl_zdiamv. Array, DIMENSION at least m if transa = 'N' or 'n', and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) REAL alpha, beta REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(lval,*), x(*), y(*) SUBROUTINE mkl_cdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) COMPLEX alpha, beta COMPLEX val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 274 SUBROUTINE mkl_zdiamv(transa, m, k, alpha, matdescra, val, lval, idiag, ndiag, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiamv(char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *x, float *beta, float *y); void mkl_ddiamv(char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *x, double *beta, double *y); void mkl_cdiamv(char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zdiamv(char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?skymv Computes matrix - vector product for a sparse matrix in the skyline storage format with one-based indexing. Syntax Fortran: call mkl_sskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_dskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_cskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) call mkl_zskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) C: mkl_sskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_dskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_cskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); mkl_zskymv(&transa, &m, &k, &alpha, matdescra, val, pntr, x, &beta, y); Include Files • FORTRAN 77: mkl_spblas.fi BLAS and Sparse BLAS Routines 2 275 • C: mkl_spblas.h Description The mkl_?skymv routine performs a matrix-vector operation defined as y := alpha*A*x + beta*y or y := alpha*A'*x + beta*y, where: alpha and beta are scalars, x and y are vectors, A is an m-by-k sparse matrix stored using the skyline storage scheme, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then y := alpha*A*x + beta*y If transa = 'T' or 't' or 'C' or 'c', then y := alpha*A'*x + beta*y, m INTEGER. Number of rows of the matrix A. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. 2 Intel® Math Kernel Library Reference Manual 276 If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. x REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array, DIMENSION at least k if transa = 'N' or 'n' and at least m otherwise. On entry, the array x must contain the vector x. beta REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Specifies the scalar beta. y REAL for mkl_sskymv. DOUBLE PRECISION for mkl_dskymv. COMPLEX for mkl_cskymv. DOUBLE COMPLEX for mkl_zskymv. Array, DIMENSION at least m if transa = 'N' or 'n' and at least k otherwise. On entry, the array y must contain the vector y. Output Parameters y Overwritten by the updated vector y. Interfaces FORTRAN 77: SUBROUTINE mkl_sskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) REAL alpha, beta REAL val(*), x(*), y(*) SUBROUTINE mkl_dskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 277 SUBROUTINE mkl_cdskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) COMPLEX alpha, beta COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zskymv(transa, m, k, alpha, matdescra, val, pntr, x, beta, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, k INTEGER pntr(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sskymv (char *transa, int *m, int *k, float *alpha, char *matdescra, float *val, int *pntr, float *x, float *beta, float *y); void mkl_dskymv (char *transa, int *m, int *k, double *alpha, char *matdescra, double *val, int *pntr, double *x, double *beta, double *y); void mkl_cskymv (char *transa, int *m, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *x, MKL_Complex8 *beta, MKL_Complex8 *y); void mkl_zskymv (char *transa, int *m, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *x, MKL_Complex16 *beta, MKL_Complex16 *y); mkl_?diasv Solves a system of linear equations for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_ddiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_cdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) call mkl_zdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) C: mkl_sdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); mkl_ddiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); mkl_cdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); 2 Intel® Math Kernel Library Reference Manual 278 mkl_zdiasv(&transa, &m, &alpha, matdescra, val, &lval, idiag, &ndiag, x, y); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasv routine solves a system of linear equations with matrix-vector operations for a sparse matrix stored in the diagonal format: y := alpha*inv(A)*x or y := alpha*inv(A')* x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')*x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. BLAS and Sparse BLAS Routines 2 279 idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. x REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sdiasv. DOUBLE PRECISION for mkl_ddiasv. COMPLEX for mkl_cdiasv. DOUBLE COMPLEX for mkl_zdiasv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) REAL alpha REAL val(lval,*), x(*), y(*) SUBROUTINE mkl_ddiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(lval,*), x(*), y(*) 2 Intel® Math Kernel Library Reference Manual 280 SUBROUTINE mkl_cdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) COMPLEX alpha COMPLEX val(lval,*), x(*), y(*) SUBROUTINE mkl_zdiasv(transa, m, alpha, matdescra, val, lval, idiag, ndiag, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, lval, ndiag INTEGER indiag(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(lval,*), x(*), y(*) C: void mkl_sdiasv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *x, float *y); void mkl_ddiasv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *x, double *y); void mkl_cdiasv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zdiasv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?skysv Solves a system of linear equations for a sparse matrix in the skyline format with one-based indexing. Syntax Fortran: call mkl_sskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_dskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_cskysv(transa, m, alpha, matdescra, val, pntr, x, y) call mkl_zskysv(transa, m, alpha, matdescra, val, pntr, x, y) C: mkl_sskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_dskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_cskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); mkl_zskysv(&transa, &m, &alpha, matdescra, val, pntr, x, y); BLAS and Sparse BLAS Routines 2 281 Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skysv routine solves a system of linear equations with matrix-vector operations for a sparse matrix in the skyline storage format: y := alpha*inv(A)*x or y := alpha*inv(A')*x, where: alpha is scalar, x and y are vectors, A is a sparse upper or lower triangular matrix with unit or non-unit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then y := alpha*inv(A)*x If transa = 'T' or 't' or 'C' or 'c', then y := alpha*inv(A')* x, m INTEGER. Number of rows of the matrix A. alpha REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. 2 Intel® Math Kernel Library Reference Manual 282 If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. x REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array, DIMENSION at least m. On entry, the array x must contain the vector x. The elements are accessed with unit increment. y REAL for mkl_sskysv. DOUBLE PRECISION for mkl_dskysv. COMPLEX for mkl_cskysv. DOUBLE COMPLEX for mkl_zskysv. Array, DIMENSION at least m. On entry, the array y must contain the vector y. The elements are accessed with unit increment. Output Parameters y Contains solution vector x. Interfaces FORTRAN 77: SUBROUTINE mkl_sskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) REAL alpha REAL val(*), x(*), y(*) SUBROUTINE mkl_dskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), x(*), y(*) BLAS and Sparse BLAS Routines 2 283 SUBROUTINE mkl_cskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) COMPLEX alpha COMPLEX val(*), x(*), y(*) SUBROUTINE mkl_zskysv(transa, m, alpha, matdescra, val, pntr, x, y) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m INTEGER pntr(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), x(*), y(*) C: void mkl_sskysv(char *transa, int *m, float *alpha, char *matdescra, float *val, int *pntr, float *x, float *y); void mkl_dskysv(char *transa, int *m, double *alpha, char *matdescra, double *val, int *pntr, double *x, double *y); void mkl_cskysv(char *transa, int *m, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *x, MKL_Complex8 *y); void mkl_zskysv(char *transa, int *m, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *x, MKL_Complex16 *y); mkl_?diamm Computes matrix-matrix product of a sparse matrix stored in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_ddiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_cdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) call mkl_zdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) 2 Intel® Math Kernel Library Reference Manual 284 C: mkl_sdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_ddiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_cdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); mkl_zdiamm(&transa, &m, &n, &k, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diamm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, B and C are dense matrices, A is an m-by-k sparse matrix in the diagonal format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. BLAS and Sparse BLAS Routines 2 285 Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. b REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Array, DIMENSION (ldb, n). On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. beta REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Specifies the scalar beta. c REAL for mkl_sdiamm. DOUBLE PRECISION for mkl_ddiamm. COMPLEX for mkl_cdiamm. DOUBLE COMPLEX for mkl_zdiamm. Array, DIMENSION (ldc, n). On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). 2 Intel® Math Kernel Library Reference Manual 286 Interfaces FORTRAN 77: SUBROUTINE mkl_sdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) REAL alpha, beta REAL val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ddiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) COMPLEX alpha, beta COMPLEX val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zdiamm(transa, m, n, k, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(lval,*), b(ldb,*), c(ldc,*) C: void mkl_sdiamm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *b, int *ldb, float *beta, float *c, int *ldc); BLAS and Sparse BLAS Routines 2 287 void mkl_ddiamm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_cdiamm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zdiamm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?skymm Computes matrix-matrix product of a sparse matrix stored using the skyline storage scheme with onebased indexing. Syntax Fortran: call mkl_sskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_dskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_cskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) call mkl_zskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) C: mkl_sskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_dskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_cskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); mkl_zskymm(&transa, &m, &n, &k, &alpha, matdescra, val, pntr, b, &ldb, &beta, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skymm routine performs a matrix-matrix operation defined as C := alpha*A*B + beta*C or C := alpha*A'*B + beta*C, where: alpha and beta are scalars, 2 Intel® Math Kernel Library Reference Manual 288 B and C are dense matrices, A is an m-by-k sparse matrix in the skyline storage format, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the operation. If transa = 'N' or 'n', then C := alpha*A*B + beta*C, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*A'*B + beta*C, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. k INTEGER. Number of columns of the matrix A. alpha REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m) for lower triangle, and (k+k) for upper triangle. It contains the indices specifying in the val the positions of the first element in each row (column) of the matrix A. Refer to pointers array description in Skyline Storage Scheme for more details. b REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. BLAS and Sparse BLAS Routines 2 289 COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array, DIMENSION (ldb, n). On entry with transa = 'N' or 'n', the leading k-by-n part of the array b must contain the matrix B, otherwise the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. beta REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Specifies the scalar beta. c REAL for mkl_sskymm. DOUBLE PRECISION for mkl_dskymm. COMPLEX for mkl_cskymm. DOUBLE COMPLEX for mkl_zskymm. Array, DIMENSION (ldc, n). On entry, the leading m-by-n part of the array c must contain the matrix C, otherwise the leading k-by-n part of the array c must contain the matrix C. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c Overwritten by the matrix (alpha*A*B + beta*C) or (alpha*A'*B + beta*C). Interfaces FORTRAN 77: SUBROUTINE mkl_sskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) REAL alpha, beta REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) DOUBLE PRECISION alpha, beta DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) 2 Intel® Math Kernel Library Reference Manual 290 SUBROUTINE mkl_cskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) COMPLEX alpha, beta COMPLEX val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zskymm(transa, m, n, k, alpha, matdescra, val, pntr, b, ldb, beta, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, k, ldb, ldc INTEGER pntr(*) DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sskymm(char *transa, int *m, int *n, int *k, float *alpha, char *matdescra, float *val, int *pntr, float *b, int *ldb, float *beta, float *c, int *ldc); void mkl_dskymm(char *transa, int *m, int *n, int *k, double *alpha, char *matdescra, double *val, int *pntr, double *b, int *ldb, double *beta, double *c, int *ldc); void mkl_cskymm(char *transa, int *m, int *n, int *k, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *b, int *ldb, MKL_Complex8 *beta, MKL_Complex8 *c, int *ldc); void mkl_zskymm(char *transa, int *m, int *n, int *k, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *b, int *ldb, MKL_Complex16 *beta, MKL_Complex16 *c, int *ldc); mkl_?diasm Solves a system of linear matrix equations for a sparse matrix in the diagonal format with one-based indexing. Syntax Fortran: call mkl_sdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) BLAS and Sparse BLAS Routines 2 291 call mkl_ddiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) call mkl_cdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) call mkl_zdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) C: mkl_sdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_ddiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_cdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); mkl_zdiasm(&transa, &m, &n, &alpha, matdescra, val, &lval, idiag, &ndiag, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?diasm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the diagonal format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. 2 Intel® Math Kernel Library Reference Manual 292 Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. val REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Two-dimensional array of size lval by ndiag, contains non-zero diagonals of the matrix A. Refer to values array description in Diagonal Storage Scheme for more details. lval INTEGER. Leading dimension of val, lval=m. Refer to lval description in Diagonal Storage Scheme for more details. idiag INTEGER. Array of length ndiag, contains the distances between main diagonal and each non-zero diagonals in the matrix A. NOTE All elements of this array must be sorted in increasing order. Refer to distance array description in Diagonal Storage Scheme for more details. ndiag INTEGER. Specifies the number of non-zero diagonals of the matrix A. b REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Array, DIMENSION (ldb, n). On entry the leading m-by-n part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c REAL for mkl_sdiasm. DOUBLE PRECISION for mkl_ddiasm. COMPLEX for mkl_cdiasm. DOUBLE COMPLEX for mkl_zdiasm. Array, DIMENSION (ldc, n). The leading m-by-n part of the array c contains the matrix C. BLAS and Sparse BLAS Routines 2 293 Interfaces FORTRAN 77: SUBROUTINE mkl_sdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) REAL alpha REAL val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_ddiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) COMPLEX alpha COMPLEX val(lval,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zdiasm(transa, m, n, alpha, matdescra, val, lval, idiag, ndiag, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc, lval, ndiag INTEGER idiag(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(lval,*), b(ldb,*), c(ldc,*) C: void mkl_sdiasm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *lval, int *idiag, int *ndiag, float *b, int *ldb, float *c, int *ldc); 2 Intel® Math Kernel Library Reference Manual 294 void mkl_ddiasm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *lval, int *idiag, int *ndiag, double *b, int *ldb, double *c, int *ldc); void mkl_cdiasm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *lval, int *idiag, int *ndiag, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zdiasm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *lval, int *idiag, int *ndiag, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?skysm Solves a system of linear matrix equations for a sparse matrix stored using the skyline storage scheme with one-based indexing. Syntax Fortran: call mkl_sskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_dskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_cskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) call mkl_zskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) C: mkl_sskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_dskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_cskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); mkl_zskysm(&transa, &m, &n, &alpha, matdescra, val, pntr, b, &ldb, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?skysm routine solves a system of linear equations with matrix-matrix operations for a sparse matrix in the skyline storage format: C := alpha*inv(A)*B or C := alpha*inv(A')*B, where: alpha is scalar, B and C are dense matrices, A is a sparse upper or lower triangular matrix with unit or nonunit main diagonal, A' is the transpose of A. BLAS and Sparse BLAS Routines 2 295 NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. transa CHARACTER*1. Specifies the system of linear equations. If transa = 'N' or 'n', then C := alpha*inv(A)*B, If transa = 'T' or 't' or 'C' or 'c', then C := alpha*inv(A')*B, m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix C. alpha REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Specifies the scalar alpha. matdescra CHARACTER. Array of six elements, specifies properties of the matrix used for operation. Only first four array elements are used, their possible values are given in Table “Possible Values of the Parameter matdescra (descra)”. Possible combinations of element values of this parameter are given in Table “Possible Combinations of Element Values of the Parameter matdescra”. NOTE General matrices (matdescra (1)='G') is not supported. val REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array containing the set of elements of the matrix A in the skyline profile form. If matdescrsa(2)= 'L', then val contains elements from the low triangle of the matrix A. If matdescrsa(2)= 'U', then val contains elements from the upper triangle of the matrix A. Refer to values array description in Skyline Storage Scheme for more details. pntr INTEGER. Array of length (m+m). It contains the indices specifying in the val the positions of the first non-zero element of each i-row (column) of the matrix A such that pointers(i)- pointers(1)+1. Refer to pointers array description in Skyline Storage Scheme for more details. b REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array, DIMENSION (ldb, n). On entry the leading m-by-n part of the array b must contain the matrix B. 2 Intel® Math Kernel Library Reference Manual 296 ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. Output Parameters c REAL for mkl_sskysm. DOUBLE PRECISION for mkl_dskysm. COMPLEX for mkl_cskysm. DOUBLE COMPLEX for mkl_zskysm. Array, DIMENSION (ldc, n). The leading m-by-n part of the array c contains the matrix C. Interfaces FORTRAN 77: SUBROUTINE mkl_sskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) REAL alpha REAL val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_dskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) DOUBLE PRECISION alpha DOUBLE PRECISION val(*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_cskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) COMPLEX alpha COMPLEX val(*), b(ldb,*), c(ldc,*) BLAS and Sparse BLAS Routines 2 297 SUBROUTINE mkl_zskysm(transa, m, n, alpha, matdescra, val, pntr, b, ldb, c, ldc) CHARACTER*1 transa CHARACTER matdescra(*) INTEGER m, n, ldb, ldc INTEGER pntr(*) DOUBLE COMPLEX alpha DOUBLE COMPLEX val(*), b(ldb,*), c(ldc,*) C: void mkl_sskysm(char *transa, int *m, int *n, float *alpha, char *matdescra, float *val, int *pntr, float *b, int *ldb, float *c, int *ldc); void mkl_dskysm(char *transa, int *m, int *n, double *alpha, char *matdescra, double *val, int *pntr, double *b, int *ldb, double *c, int *ldc); void mkl_cskysm(char *transa, int *m, int *n, MKL_Complex8 *alpha, char *matdescra, MKL_Complex8 *val, int *pntr, MKL_Complex8 *b, int *ldb, MKL_Complex8 *c, int *ldc); void mkl_zskysm(char *transa, int *m, int *n, MKL_Complex16 *alpha, char *matdescra, MKL_Complex16 *val, int *pntr, MKL_Complex16 *b, int *ldb, MKL_Complex16 *c, int *ldc); mkl_?dnscsr Convert a sparse matrix in dense representation to the CSR format and vice versa. Syntax Fortran: call mkl_sdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_ddnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_cdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) call mkl_zdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) C: mkl_sdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_ddnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_cdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); mkl_zdnscsr(job, &m, &n, adns, &lda, acsr, ja, ia, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts an sparse matrix stored as a rectangular m-by-n matrix A (dense representation) to the compressed sparse row (CSR) format (3-array variation) and vice versa. 2 Intel® Math Kernel Library Reference Manual 298 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the rectangular matrix A is converted to the CSR format; if job(1)=1, the rectangular matrix A is restored from the CSR format. job(2) If job(2)=0, zero-based indexing for the rectangular matrix A is used; if job(2)=1, one-based indexing for the rectangular matrix A is used. job(3) If job(3)=0, zero-based indexing for the matrix in CSR format is used; if job(3)=1, one-based indexing for the matrix in CSR format is used. job(4) If job(4)=0, adns is a lower triangular part of matrix A; If job(4)=1, adns is an upper triangular part of matrix A; If job(4)=2, adns is a whole matrix A. job(5) job(5)=nzmax - maximum number of the non-zero elements allowed if job(1)=0. job(6) - job indicator for conversion to CSR format. If job(6)=0, only array ia is generated for the output storage. If job(6)>0, arrays acsr, ia, ja are generated for the output storage. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. adns (input/output) REAL for mkl_sdnscsr. DOUBLE PRECISION for mkl_ddnscsr. COMPLEX for mkl_cdnscsr. DOUBLE COMPLEX for mkl_zdnscsr. Array containing non-zero elements of the matrix A. lda (input/output)INTEGER. Specifies the leading dimension of adns as declared in the calling (sub)program, must be at least max(1, m). acsr (input/output) REAL for mkl_sdnscsr. DOUBLE PRECISION for mkl_ddnscsr. COMPLEX for mkl_cdnscsr. DOUBLE COMPLEX for mkl_zdnscsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output)INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output)INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element BLAS and Sparse BLAS Routines 2 299 ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. Output Parameters info INTEGER. Integer info indicator only for restoring the matrix A from the CSR format. If info=0, the execution is successful. If info=i, the routine is interrupted processing the i-th row because there is no space in the arrays adns and ja according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_sdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) REAL adns(*), acsr(*) SUBROUTINE mkl_ddnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) DOUBLE PRECISION adns(*), acsr(*) SUBROUTINE mkl_cdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) COMPLEX adns(*), acsr(*) SUBROUTINE mkl_zdnscsr(job, m, n, adns, lda, acsr, ja, ia, info) INTEGER job(8) INTEGER m, n, lda, info INTEGER ja(*), ia(m+1) DOUBLE COMPLEX adns(*), acsr(*) C: void mkl_sdnscsr(int *job, int *m, int *n, float *adns, int *lda, float *acsr, int *ja, int *ia, int *info); void mkl_ddnscsr(int *job, int *m, int *n, double *adns, int *lda, double *acsr, int *ja, int *ia, int *info); void mkl_cdnscsr(int *job, int *m, int *n, MKL_Complex8 *adns, int *lda, MKL_Complex8 *acsr, int *ja, int *ia, int *info); 2 Intel® Math Kernel Library Reference Manual 300 void mkl_zdnscsr(int *job, int *m, int *n, MKL_Complex16 *adns, int *lda, MKL_Complex16 *acsr, int *ja, int *ia, int *info); mkl_?csrcoo Converts a sparse matrix in the CSR format to the coordinate format and vice versa. Syntax Fortran: call mkl_scsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_dcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_ccsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) call mkl_zcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) C: mkl_scsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_dcsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_ccsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); mkl_zcsrcoo(job, &n, acsr, ja, ia, &nnz, acoo, rowind, colind, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to coordinate format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the coordinate format; if job(1)=1, the matrix in the coordinate format is converted to the CSR format. if job(1)=2, the matrix in the coordinate format is converted to the CSR format, and the column indices in CSR representation are sorted in the increasing order within each row. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) BLAS and Sparse BLAS Routines 2 301 If job(3)=0, zero-based indexing for the matrix in coordinate format is used; if job(3)=1, one-based indexing for the matrix in coordinate format is used. job(5) job(5)=nzmax - maximum number of the non-zero elements allowed if job(1)=0. job(5)=nnz - sets number of the non-zero elements of the matrix A if job(1)=1. job(6) - job indicator. For conversion to the coordinate format: If job(6)=1, only array rowind is filled in for the output storage. If job(6)=2, arrays rowind, colind are filled in for the output storage. If job(6)=3, all arrays rowind, colind, acoo are filled in for the output storage. For conversion to the CSR format: If job(6)=0, all arrays acsr, ja, ia are filled in for the output storage. If job(6)=1, only array ia is filled in for the output storage. If job(6)=2, then it is assumed that the routine already has been called with the job(6)=1, and the user allocated the required space for storing the output arrays acsr and ja. n INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrcoo. DOUBLE PRECISION for mkl_dcsrcoo. COMPLEX for mkl_ccsrcoo. DOUBLE COMPLEX for mkl_zcsrcoo. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length n + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(n + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. acoo (input/output) REAL for mkl_scsrcoo. DOUBLE PRECISION for mkl_dcsrcoo. COMPLEX for mkl_ccsrcoo. DOUBLE COMPLEX for mkl_zcsrcoo. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. rowind (input/output)INTEGER. Array of length nnz, contains the row indices for each non-zero element of the matrix A. Refer to rows array description in Coordinate Format for more details. 2 Intel® Math Kernel Library Reference Manual 302 colind (input/output) INTEGER. Array of length nnz, contains the column indices for each non-zero element of the matrix A. Refer to columns array description in Coordinate Format for more details. Output Parameters nnz INTEGER. Specifies the number of non-zero element of the matrix A. Refer to nnz description in Coordinate Format for more details. info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, the routine is interrupted because there is no space in the arrays acoo, rowind, colind according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) REAL acsr(*), acoo(*) SUBROUTINE mkl_dcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) DOUBLE PRECISION acsr(*), acoo(*) SUBROUTINE mkl_ccsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) COMPLEX acsr(*), acoo(*) SUBROUTINE mkl_zcsrcoo(job, n, acsr, ja, ia, nnz, acoo, rowind, colind, info) INTEGER job(8) INTEGER n, nnz, info INTEGER ja(*), ia(n+1), rowind(*), colind(*) DOUBLE COMPLEX acsr(*), acoo(*) C: void mkl_scsrcoo(int *job, int *n, float *acsr, int *ja, int *ia, int *nnz, float *acoo, int *rowind, int *colind, int *info); void mkl_dcsrcoo(int *job, int *n, double *acsr, int *ja, int *ia, int *nnz, double *acoo, int *rowind, int *colind, int *info); BLAS and Sparse BLAS Routines 2 303 void mkl_ccsrcoo(int *job, int *n, MKL_Complex8 *acsr, int *ja, int *ia, int *nnz, MKL_Complex8 *acoo, int *rowind, int *colind, int *info); void mkl_zcsrcoo(int *job, int *n, MKL_Complex16 *acsr, int *ja, int *ia, int *nnz, MKL_Complex16 *acoo, int *rowind, int *colind, int *info); mkl_?csrbsr Converts a sparse matrix in the CSR format to the BSR format and vice versa. Syntax Fortran: call mkl_scsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_dcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_ccsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) call mkl_zcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) C: mkl_scsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_dcsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_ccsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); mkl_zcsrbsr(job, &m, &mblk, &ldabsr, acsr, ja, ia, absr, jab, iab, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the block sparse row (BSR) format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the BSR format; if job(1)=1, the matrix in the BSR format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the BSR format is used; if job(3)=1, one-based indexing for the matrix in the BSR format is used. 2 Intel® Math Kernel Library Reference Manual 304 job(4) is only used for conversion to CSR format. By default, the converter saves the blocks without checking whether an element is zero or not. If job(4)=1, then the converter only saves non-zero elements in blocks. job(6) - job indicator. For conversion to the BSR format: If job(6)=0, only arrays jab, iab are generated for the output storage. If job(6)>0, all output arrays absr, jab, and iab are filled in for the output storage. If job(6)=-1, iab(1) returns the number of non-zero blocks. For conversion to the CSR format: If job(6)=0, only arrays ja, ia are generated for the output storage. m INTEGER. Actual row dimension of the matrix A for convert to the BSR format; block row dimension of the matrix A for convert to the CSR format. mblk INTEGER. Size of the block in the matrix A. ldabsr INTEGER. Leading dimension of the array absr as declared in the calling program. ldabsr must be greater than or equal to mblk*mblk. acsr (input/output) REAL for mkl_scsrbsr. DOUBLE PRECISION for mkl_dcsrbsr. COMPLEX for mkl_ccsrbsr. DOUBLE COMPLEX for mkl_zcsrbsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. absr (input/output) REAL for mkl_scsrbsr. DOUBLE PRECISION for mkl_dcsrbsr. COMPLEX for mkl_ccsrbsr. DOUBLE COMPLEX for mkl_zcsrbsr. Array containing elements of non-zero blocks of the matrix A. Its length is equal to the number of non-zero blocks in the matrix A multiplied by mblk*mblk. Refer to values array description in BSR Format for more details. jab (input/output) INTEGER. Array containing the column indices for each nonzero block of the matrix A. Its length is equal to the number of non-zero blocks of the matrix A. Refer to columns array description in BSR Format for more details. BLAS and Sparse BLAS Routines 2 305 iab (input/output) INTEGER. Array of length (m + 1), containing indices of blocks in the array absr, such that iab(i) is the index in the array absr of the first non-zero element from the i-th row . The value of the last element iab(m + 1) is equal to the number of non-zero blocks plus one. Refer to rowIndex array description in BSR Format for more details. Output Parameters info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, it means that mblk is equal to 0. If info=2, it means that ldabsr is less than mblk*mblk and there is no space for all blocks. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) REAL acsr(*), absr(ldabsr,*) SUBROUTINE mkl_dcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) DOUBLE PRECISION acsr(*), absr(ldabsr,*) SUBROUTINE mkl_ccsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) COMPLEX acsr(*), absr(ldabsr,*) SUBROUTINE mkl_zcsrbsr(job, m, mblk, ldabsr, acsr, ja, ia, absr, jab, iab, info) INTEGER job(8) INTEGER m, mblk, ldabsr, info INTEGER ja(*), ia(m+1), jab(*), iab(*) DOUBLE COMPLEX acsr(*), absr(ldabsr,*) C: void mkl_scsrbsr(int *job, int *m, int *mblk, int *ldabsr, float *acsr, int *ja, int *ia, float *absr, int *jab, int *iab, int *info); void mkl_dcsrbsr(int *job, int *m, int *mblk, int *ldabsr, double *acsr, int *ja, int *ia, double *absr, int *jab, int *iab, int *info); 2 Intel® Math Kernel Library Reference Manual 306 void mkl_ccsrbsr(int *job, int *m, int *mblk, int *ldabsr, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *absr, int *jab, int *iab, int *info); void mkl_zcsrbsr(int *job, int *m, int *mblk, int *ldabsr, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *absr, int *jab, int *iab, int *info); mkl_?csrcsc Converts a square sparse matrix in the CSR format to the CSC format and vice versa. Syntax Fortran: call mkl_scsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_dcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_ccsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) call mkl_zcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) C: mkl_scsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_dcsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_ccsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); mkl_zcsrcsc(job, &m, acsr, ja, ia, acsc, ja1, ia1, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a square sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the compressed sparse column (CSC) format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the CSC format; if job(1)=1, the matrix in the CSC format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the CSC format is used; if job(3)=1, one-based indexing for the matrix in the CSC format is used. job(6) - job indicator. BLAS and Sparse BLAS Routines 2 307 For conversion to the CSC format: If job(6)=0, only arrays ja1, ia1 are filled in for the output storage. If job(6)?0, all output arrays acsc, ja1, and ia1 are filled in for the output storage. For conversion to the CSR format: If job(6)=0, only arrays ja, ia are filled in for the output storage. If job(6)?0, all output arrays acsr, ja, and ia are filled in for the output storage. m INTEGER. Dimension of the square matrix A. acsr (input/output) REAL for mkl_scsrcsc. DOUBLE PRECISION for mkl_dcsrcsc. COMPLEX for mkl_ccsrcsc. DOUBLE COMPLEX for mkl_zcsrcsc. Array containing non-zero elements of the square matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. acsc (input/output) REAL for mkl_scsrcsc. DOUBLE PRECISION for mkl_dcsrcsc. COMPLEX for mkl_ccsrcsc. DOUBLE COMPLEX for mkl_zcsrcsc. Array containing non-zero elements of the square matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja1 (input/output) INTEGER. Array containing the row indices for each non-zero element of the matrix A. Its length is equal to the length of the array acsc. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia1 (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsc, such that ia1(I) is the index in the array acsc of the first non-zero element from the column I. The value of the last element ia1(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. Output Parameters info INTEGER. This parameter is not used now. 2 Intel® Math Kernel Library Reference Manual 308 Interfaces FORTRAN 77: SUBROUTINE mkl_scsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) REAL acsr(*), acsc(*) SUBROUTINE mkl_dcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) DOUBLE PRECISION acsr(*), acsc(*) SUBROUTINE mkl_ccsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) COMPLEX acsr(*), acsc(*) SUBROUTINE mkl_zcsrcsc(job, m, acsr, ja, ia, acsc, ja1, ia1, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), ja1(*), ia1(m+1) DOUBLE COMPLEX acsr(*), acsc(*) C: void mkl_scsrcsc(int *job, int *m, float *acsr, int *ja, int *ia, float *acsc, int *ja1, int *ia1, int *info); void mkl_dcsrcsc(int *job, int *m, double *acsr, int *ja, int *ia, double *acsc, int *ja1, int *ia1, int *info); void mkl_ccsrcsc(int *job, int *m, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *acsc, int *ja1, int *ia1, int *info); void mkl_zcsrcsc(int *job, int *m, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *acsc, int *ja1, int *ia1, int *info); mkl_?csrdia Converts a sparse matrix in the CSR format to the diagonal format and vice versa. BLAS and Sparse BLAS Routines 2 309 Syntax Fortran: call mkl_scsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_dcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_ccsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) call mkl_zcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) C: mkl_scsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_dcsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_ccsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); mkl_zcsrdia(job, &m, acsr, ja, ia, adia, &ngiag, distance, &idiag, acsr_rem, ja_rem, ia_rem, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the diagonal format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the diagonal format; if job(1)=1, the matrix in the diagonal format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. job(3) If job(3)=0, zero-based indexing for the matrix in the diagonal format is used; if job(3)=1, one-based indexing for the matrix in the diagonal format is used. 2 Intel® Math Kernel Library Reference Manual 310 job(6) - job indicator. For conversion to the diagonal format: If job(6)=0, diagonals are not selected internally, and acsr_rem, ja_rem, ia_rem are not filled in for the output storage. If job(6)=1, diagonals are not selected internally, and acsr_rem, ja_rem, ia_rem are filled in for the output storage. If job(6)=10, diagonals are selected internally, and acsr_rem, ja_rem, ia_rem are not filled in for the output storage. If job(6)=11, diagonals are selected internally, and csr_rem, ja_rem, ia_rem are filled in for the output storage. For conversion to the CSR format: If job(6)=0, each entry in the array adia is checked whether it is zero. Zero entries are not included in the array acsr. If job(6)?0, each entry in the array adia is not checked whether it is zero. m INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrdia. DOUBLE PRECISION for mkl_dcsrdia. COMPLEX for mkl_ccsrdia. DOUBLE COMPLEX for mkl_zcsrdia. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. adia (input/output) REAL for mkl_scsrdia. DOUBLE PRECISION for mkl_dcsrdia. COMPLEX for mkl_ccsrdia. DOUBLE COMPLEX for mkl_zcsrdia. Array of size (ndiag x idiag) containing diagonals of the matrix A. The key point of the storage is that each element in the array adia retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. ndiag INTEGER. Specifies the leading dimension of the array adia as declared in the calling (sub)program, must be at least max(1, m). distance INTEGER. Array of length idiag, containing the distances between the main diagonal and each non-zero diagonal to be extracted. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero. BLAS and Sparse BLAS Routines 2 311 idiag INTEGER. Number of diagonals to be extracted. For conversion to diagonal format on return this parameter may be modified. acsr_rem, ja_rem, ia_rem Remainder of the matrix in the CSR format if it is needed for conversion to the diagonal format. Output Parameters info INTEGER. This parameter is not used now. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) REAL acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_dcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) DOUBLE PRECISION acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_ccsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) COMPLEX acsr(*), adia(*), acsr_rem(*) SUBROUTINE mkl_zcsrdia(job, m, acsr, ja, ia, adia, ndiag, distance, idiag, acsr_rem, ja_rem, ia_rem, info) INTEGER job(8) INTEGER m, info, ndiag, idiag INTEGER ja(*), ia(m+1), distance(*), ja_rem(*), ia_rem(*) DOUBLE COMPLEX acsr(*), adia(*), acsr_rem(*) C: void mkl_scsrdia(int *job, int *m, float *acsr, int *ja, int *ia, float *adia, int *ndiag, int *distance, int *distance, int *idiag, float *acsr_rem, int *ja_rem, int *ia_rem, int *info); void mkl_dcsrdia(int *job, int *m, double *acsr, int *ja, int *ia, double *adia, int *ndiag, int *distance, int *distance, int *idiag, double *acsr_rem, int *ja_rem, int *ia_rem, int *info); 2 Intel® Math Kernel Library Reference Manual 312 void mkl_ccsrdia(int *job, int *m, MKL_Complex8 *acsr, int *ja, int *ia, MKL_Complex8 *adia, int *ndiag, int *distance, int *distance, int *idiag, MKL_Complex8 *acsr_rem, int *ja_rem, int *ia_rem, int *info); void mkl_zcsrdia(int *job, int *m, MKL_Complex16 *acsr, int *ja, int *ia, MKL_Complex16 *adia, int *ndiag, int *distance, int *distance, int *idiag, MKL_Complex16 *acsr_rem, int *ja_rem, int *ia_rem, int *info); mkl_?csrsky Converts a sparse matrix in CSR format to the skyline format and vice versa. Syntax Fortran: call mkl_scsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_dcsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_ccsrsky(job, m, acsr, ja, ia, asky, pointers, info) call mkl_zcsrsky(job, m, acsr, ja, ia, asky, pointers, info) C: mkl_scsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_dcsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_ccsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); mkl_zcsrsky(job, &m, acsr, ja, ia, asky, pointers, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description This routine converts a sparse matrix A stored in the compressed sparse row (CSR) format (3-array variation) to the skyline format and vice versa. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. job INTEGER Array, contains the following conversion parameters: job(1) If job(1)=0, the matrix in the CSR format is converted to the skyline format; if job(1)=1, the matrix in the skyline format is converted to the CSR format. job(2) If job(2)=0, zero-based indexing for the matrix in CSR format is used; if job(2)=1, one-based indexing for the matrix in CSR format is used. BLAS and Sparse BLAS Routines 2 313 job(3) If job(3)=0, zero-based indexing for the matrix in the skyline format is used; if job(3)=1, one-based indexing for the matrix in the skyline format is used. job(4) For conversion to the skyline format: If job(4)=0, the upper part of the matrix A in the CSR format is converted. If job(4)=1, the lower part of the matrix A in the CSR format is converted. For conversion to the CSR format: If job(4)=0, the matrix is converted to the upper part of the matrix A in the CSR format. If job(4)=1, the matrix is converted to the lower part of the matrix A in the CSR format. job(5) job(5)=nzmax - maximum number od the non-zero elements of the matrix A if job(1)=0. job(6) - job indicator. Only for conversion to the skyline format: If job(6)=0, only arrays pointers is filled in for the output storage. If job(6)=1, all output arrays asky and pointers are filled in for the output storage. m INTEGER. Dimension of the matrix A. acsr (input/output) REAL for mkl_scsrsky. DOUBLE PRECISION for mkl_dcsrsky. COMPLEX for mkl_ccsrsky. DOUBLE COMPLEX for mkl_zcsrsky. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja (input/output) INTEGER. Array containing the column indices for each nonzero element of the matrix A. Its length is equal to the length of the array acsr. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia (input/output) INTEGER. Array of length m + 1, containing indices of elements in the array acsr, such that ia(I) is the index in the array acsr of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zeros plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. asky (input/output) REAL for mkl_scsrsky. DOUBLE PRECISION for mkl_dcsrsky. COMPLEX for mkl_ccsrsky. DOUBLE COMPLEX for mkl_zcsrsky. Array, for a lower triangular part of A it contains the set of elements from each row starting from the first none-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero 2 Intel® Math Kernel Library Reference Manual 314 element down to and including the diagonal element. Encountered zero elements are included in the sets. Refer to values array description in Skyline Storage Format for more details. pointers (input/output) INTEGER. Array with dimension (m+1), where m is number of rows for lower triangle (columns for upper triangle), pointers(I) - pointers(1)+1 gives the index of element in the array asky that is first non-zero element in row (column)I . The value of pointers(m +1) is set tonnz + pointers(1), wherennz is the number of elements in the array asky. Refer to pointers array description in Skyline Storage Format for more details Output Parameters info INTEGER. Integer info indicator only for converting the matrix A from the CSR format. If info=0, the execution is successful. If info=1, the routine is interrupted because there is no space in the array asky according to the value nzmax. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) REAL acsr(*), asky(*) SUBROUTINE mkl_dcsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) DOUBLE PRECISION acsr(*), asky(*) SUBROUTINE mkl_ccsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) COMPLEX acsr(*), asky(*) SUBROUTINE mkl_zcsrsky(job, m, acsr, ja, ia, asky, pointers, info) INTEGER job(8) INTEGER m, info INTEGER ja(*), ia(m+1), pointers(m+1) DOUBLE COMPLEX acsr(*), asky(*) BLAS and Sparse BLAS Routines 2 315 C: void mkl_scsrsky(int *job, int *m, float *acsr, int *ja, int *ia, float *asky, int *pointers, int *info); void mkl_dcsrsky(int *job, int *m, double *acsr, int *ja, int *ia, double *asky, int *pointers, int *info); void mkl_ccsrsky(int *job, int *m, MKL_COMPLEX8 *acsr, int *ja, int *ia, MKL_COMPLEX8 *asky, int *pointers, int *info); void mkl_zcsrsky(int *job, int *m, MKL_COMPLEX16 *acsr, int *ja, int *ia, MKL_COMPLEX16 *asky, int *pointers, int *info); mkl_?csradd Computes the sum of two matrices stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_dcsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_ccsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) call mkl_zcsradd(trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) C: mkl_scsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_dcsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_ccsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_zcsradd(&trans, &request, &sort, &m, &n, a, ja, ia, &beta, b, jb, ib, c, jc, ic, &nzmax, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csradd routine performs a matrix-matrix operation defined as C := A+beta*op(B) where: A, B, C are the sparse matrices in the CSR format (3-array variation). 2 Intel® Math Kernel Library Reference Manual 316 op(B) is one of op(B) = B, or op(B) = B', or op(A) = conjg(B') beta is a scalar. The routine works correctly if and only if the column indices in sparse matrix representations of matrices A and B are arranged in the increasing order for each row. If not, use the parameter sort (see below) to reorder column indices and the corresponding elements of the input matrices. NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A+beta*B If trans = 'T' or 't' or 'C' or 'c', then C := A+beta*B'. request INTEGER. If request=0, the routine performs addition, the memory for the output arrays ic, jc, c must be allocated beforehand. If request=1, the routine computes only values of the array ic of length m + 1, the memory for this array must be allocated beforehand. On exit the value ic(m+1) - 1 is the actual number of the elements in the arrays c and jc. If request=2, the routine has been called previously with the parameter request=1, the output arrays jc and c are allocated in the calling program and they are of the length (m+1)-1 at least. sort INTEGER. Specifies the type of reordering. If this parameter is not set (default), the routine does not perform reordering. If sort=1, the routine arranges the column indices ja for each row in the increasing order and reorders the corresponding values of the matrix A in the array a. If sort=2, the routine arranges the column indices jb for each row in the increasing order and reorders the corresponding values of the matrix B in the array b. If sort=3, the routine performs reordering for both input matrices A and B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. a REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. BLAS and Sparse BLAS Routines 2 317 ia INTEGER. Array of length m + 1, containing indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. beta REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Specifies the scalar beta. b REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(m + 1) or ib(n + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. nzmax INTEGER. The length of the arrays c and jc. This parameter is used only if request=0. The routine stops calculation if the number of elements in the result matrix C exceeds the specified value of nzmax. Output Parameters c REAL for mkl_scsradd. DOUBLE PRECISION for mkl_dcsradd. COMPLEX for mkl_ccsradd. DOUBLE COMPLEX for mkl_zcsradd. Array containing non-zero elements of the result matrix C. Its length is equal to the number of non-zero elements in the matrix C. Refer to values array description in Sparse Matrix Storage Formats for more details. jc INTEGER. Array containing the column indices for each non-zero element of the matrix C. The length of this array is equal to the length of the array c. Refer to columns array description in Sparse Matrix Storage Formats for more details. 2 Intel® Math Kernel Library Reference Manual 318 ic INTEGER. Array of length m + 1, containing indices of elements in the array c, such that ic(I) is the index in the array c of the first non-zero element from the row I. The value of the last element ic(m + 1) is equal to the number of non-zero elements of the matrix C plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. info INTEGER. If info=0, the execution is successful. If info=I>0, the routine stops calculation in the I-th row of the matrix C because number of elements in C exceeds nzmax. If info=-1, the routine calculates only the size of the arrays c and jc and returns this value plus 1 as the last element of the array ic. Interfaces FORTRAN 77: SUBROUTINE mkl_scsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) REAL a(*), b(*), c(*), beta SUBROUTINE mkl_dcsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE PRECISION a(*), b(*), c(*), beta SUBROUTINE mkl_ccsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) COMPLEX a(*), b(*), c(*), beta SUBROUTINE mkl_zcsradd( trans, request, sort, m, n, a, ja, ia, beta, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER trans INTEGER request, sort, m, n, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE COMPLEX a(*), b(*), c(*), beta C: void mkl_scsradd(char *trans, int *request, int *sort, int *m, int *n, float *a, int *ja, int *ia, float *beta, float *b, int *jb, int *ib, float *c, int *jc, int *ic, int *nzmax, int *info); void mkl_dcsradd(char *trans, int *request, int *sort, int *m, int *n, double *a, int *ja, int *ia, double *beta, double *b, int *jb, int *ib, double *c, int *jc, int *ic, int *nzmax, int *info); BLAS and Sparse BLAS Routines 2 319 void mkl_ccsradd(char *trans, int *request, int *sort, int *m, int *n, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *beta, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *jc, int *ic, int *nzmax, int *info); void mkl_zcsradd(char *trans, int *request, int *sort, int *m, int *n, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *beta, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *jc, int *ic, int *nzmax, int *info); mkl_?csrmultcsr Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. Syntax Fortran: call mkl_scsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_dcsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_ccsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) call mkl_zcsrmultcsr(trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) C: mkl_scsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_dcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_ccsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); mkl_zcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmultcsr routine performs a matrix-matrix operation defined as C := op(A)*B where: A, B, C are the sparse matrices in the CSR format (3-array variation); op(A) is one of op(A) = A, or op(A) =A', or op(A) = conjg(A') . You can use the parameter sort to perform or not perform reordering of non-zero entries in input and output sparse matrices. The purpose of reordering is to rearrange non-zero entries in compressed sparse row matrix so that column indices in compressed sparse representation are sorted in the increasing order for each row. 2 Intel® Math Kernel Library Reference Manual 320 The following table shows correspondence between the value of the parameter sort and the type of reordering performed by this routine for each sparse matrix involved: Value of the parameter sort Reordering of A (arrays a, ja, ia) Reordering of B (arrays b, ja, ib) Reordering of C (arrays c, jc, ic) 1 yes no yes 2 no yes yes 3 yes yes yes 4 yes no no 5 no yes no 6 yes yes no 7 no no no arbitrary value not equal to 1, 2,..., 7 no no yes NOTE This routine supports only one-based indexing of the input arrays. Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A*B If trans = 'T' or 't' or 'C' or 'c', then C := A'*B. request INTEGER. If request=0, the routine performs multiplication, the memory for the output arrays ic, jc, c must be allocated beforehand. If request=1, the routine computes only values of the array ic of length m + 1, the memory for this array must be allocated beforehand. On exit the value ic(m+1) - 1 is the actual number of the elements in the arrays c and jc. If request=2, the routine has been called previously with the parameter request=1, the output arrays jc and c are allocated in the calling program and they are of the length ic(m+1) - 1 at least. sort INTEGER. Specifies whether the routine performs reordering of non-zeros entries in input and/or output sparse matrices (see table above). m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. k INTEGER. Number of columns of the matrix B. a REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. BLAS and Sparse BLAS Routines 2 321 The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1. This array contains indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) is equal to the number of non-zero elements of the matrix A plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. b REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length n + 1 when trans = 'N' or 'n', or m + 1 otherwise. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(n + 1) or ib(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. nzmax INTEGER. The length of the arrays c and jc. This parameter is used only if request=0. The routine stops calculation if the number of elements in the result matrix C exceeds the specified value of nzmax. Output Parameters c REAL for mkl_scsrmultcsr. DOUBLE PRECISION for mkl_dcsrmultcsr. COMPLEX for mkl_ccsrmultcsr. DOUBLE COMPLEX for mkl_zcsrmultcsr. Array containing non-zero elements of the result matrix C. Its length is equal to the number of non-zero elements in the matrix C. Refer to values array description in Sparse Matrix Storage Formats for more details. jc INTEGER. Array containing the column indices for each non-zero element of the matrix C. The length of this array is equal to the length of the array c. Refer to columns array description in Sparse Matrix Storage Formats for more details. ic INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. 2 Intel® Math Kernel Library Reference Manual 322 This array contains indices of elements in the array c, such that ic(I) is the index in the array c of the first non-zero element from the row I. The value of the last element ic(m + 1) or ic(n + 1) is equal to the number of non-zero elements of the matrix C plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. info INTEGER. If info=0, the execution is successful. If info=I>0, the routine stops calculation in the I-th row of the matrix C because number of elements in C exceeds nzmax. If info=-1, the routine calculates only the size of the arrays c and jc and returns this value plus 1 as the last element of the array ic. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) REAL a(*), b(*), c(*) SUBROUTINE mkl_dcsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE PRECISION a(*), b(*), c(*) SUBROUTINE mkl_ccsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) COMPLEX a(*), b(*), c(*) SUBROUTINE mkl_zcsrmultcsr( trans, request, sort, m, n, k, a, ja, ia, b, jb, ib, c, jc, ic, nzmax, info) CHARACTER*1 trans INTEGER request, sort, m, n, k, nzmax, info INTEGER ja(*), jb(*), jc(*), ia(*), ib(*), ic(*) DOUBLE COMPLEX a(*), b(*), c(*) C: void mkl_scsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, float *a, int *ja, int *ia, float *b, int *jb, int *ib, float *c, int *jc, int *ic, int *nzmax, int *info); BLAS and Sparse BLAS Routines 2 323 void mkl_dcsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, double *a, int *ja, int *ia, double *b, int *jb, int *ib, double *c, int *jc, int *ic, int *nzmax, int *info); void mkl_ccsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *jc, int *ic, int *nzmax, int *info); void mkl_zcsrmultcsr(char *trans, int *request, int *sort, int *m, int *n, int *k, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *jc, int *ic, int *nzmax, int *info); mkl_?csrmultd Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing. The result is stored in the dense matrix. Syntax Fortran: call mkl_scsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_dcsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_ccsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) call mkl_zcsrmultd(trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) C: mkl_scsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_dcsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_ccsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); mkl_zcsrmultd(&trans, &m, &n, &k, a, ja, ia, b, jb, ib, c, &ldc); Include Files • FORTRAN 77: mkl_spblas.fi • C: mkl_spblas.h Description The mkl_?csrmultd routine performs a matrix-matrix operation defined as C := op(A)*B where: A, B are the sparse matrices in the CSR format (3-array variation), C is dense matrix; op(A) is one of op(A) = A, or op(A) =A', or op(A) = conjg(A') . The routine works correctly if and only if the column indices in sparse matrix representations of matrices A and B are arranged in the increasing order for each row. If not, use the parameter sort (see below) to reorder column indices and the corresponding elements of the input matrices. NOTE This routine supports only one-based indexing of the input arrays. 2 Intel® Math Kernel Library Reference Manual 324 Input Parameters Parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. trans CHARACTER*1. Specifies the operation. If trans = 'N' or 'n', then C := A*B If trans = 'T' or 't' or 'C' or 'c', then C := A'*B. m INTEGER. Number of rows of the matrix A. n INTEGER. Number of columns of the matrix A. k INTEGER. Number of columns of the matrix B. a REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the matrix A. Its length is equal to the number of non-zero elements in the matrix A. Refer to values array description in Sparse Matrix Storage Formats for more details. ja INTEGER. Array containing the column indices for each non-zero element of the matrix A. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array a. Refer to columns array description in Sparse Matrix Storage Formats for more details. ia INTEGER. Array of length m + 1 when trans = 'N' or 'n', or n + 1 otherwise. This array contains indices of elements in the array a, such that ia(I) is the index in the array a of the first non-zero element from the row I. The value of the last element ia(m + 1) or ia(n + 1) is equal to the number of non-zero elements of the matrix A plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. b REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the matrix B. Its length is equal to the number of non-zero elements in the matrix B. Refer to values array description in Sparse Matrix Storage Formats for more details. jb INTEGER. Array containing the column indices for each non-zero element of the matrix B. For each row the column indices must be arranged in the increasing order. The length of this array is equal to the length of the array b. Refer to columns array description in Sparse Matrix Storage Formats for more details. ib INTEGER. Array of length m + 1. This array contains indices of elements in the array b, such that ib(I) is the index in the array b of the first non-zero element from the row I. The value of the last element ib(m + 1) is equal to the number of non-zero elements of the matrix B plus one. Refer to rowIndex array description in Sparse Matrix Storage Formats for more details. BLAS and Sparse BLAS Routines 2 325 Output Parameters c REAL for mkl_scsrmultd. DOUBLE PRECISION for mkl_dcsrmultd. COMPLEX for mkl_ccsrmultd. DOUBLE COMPLEX for mkl_zcsrmultd. Array containing non-zero elements of the result matrix C. ldc INTEGER. Specifies the leading dimension of the dense matrix C as declared in the calling (sub)program. Must be at least max(m, 1) when trans = 'N' or 'n', or max(1, n) otherwise. Interfaces FORTRAN 77: SUBROUTINE mkl_scsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) REAL a(*), b(*), c(ldc, *) SUBROUTINE mkl_dcsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) DOUBLE PRECISION a(*), b(*), c(ldc, *) SUBROUTINE mkl_ccsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) COMPLEX a(*), b(*), c(ldc, *) SUBROUTINE mkl_zcsrmultd( trans, m, n, k, a, ja, ia, b, jb, ib, c, ldc) CHARACTER*1 trans INTEGER m, n, k, ldc INTEGER ja(*), jb(*), ia(*), ib(*) DOUBLE COMPLEX a(*), b(*), c(ldc, *) C: void mkl_scsrmultd(char *trans, int *m, int *n, int *k, float *a, int *ja, int *ia, float *b, int *jb, int *ib, float *c, int *ldc); void mkl_dcsrmultd(char *trans, int *m, int *n, int *k, double *a, int *ja, int *ia, double *b, int *jb, int *ib, double *c, int *ldc); void mkl_ccsrmultd(char *trans, int *m, int *n, int *k, MKL_Complex8 *a, int *ja, int *ia, MKL_Complex8 *b, int *jb, int *ib, MKL_Complex8 *c, int *ldc); 2 Intel® Math Kernel Library Reference Manual 326 void mkl_zcsrmultd(char *trans, int *m, int *n, int *k, MKL_Complex16 *a, int *ja, int *ia, MKL_Complex16 *b, int *jb, int *ib, MKL_Complex16 *c, int *ldc); BLAS-like Extensions Intel MKL provides C and Fortran routines to extend the functionality of the BLAS routines. These include routines to compute vector products, matrix-vector products, and matrix-matrix products. Intel MKL also provides routines to perform certain data manipulation, including matrix in-place and out-ofplace transposition operations combined with simple matrix arithmetic operations. Transposition operations are Copy As Is, Conjugate transpose, Transpose, and Conjugate. Each routine adds the possibility of scaling during the transposition operation by giving some alpha and/or beta parameters. Each routine supports both row-major orderings and column-major orderings. Table “BLAS-like Extensions” lists these routines. The symbol in the routine short names is a precision prefix that indicates the data type: s REAL for Fortran interface, or float for C interface d DOUBLE PRECISION for Fortran interface, or double for C interface. c COMPLEX for Fortran interface, or MKL_Complex8 for C interface. z DOUBLE COMPLEX for Fortran interface, or MKL_Complex16 for C interface. BLAS-like Extensions Routine Data Types Description axpby s, d, c, z Scales two vectors, adds them to one another and stores result in the vector (routines) gem2vu s, d Two matrix-vector products using a general matrix, real data gem2vc c, z Two matrix-vector products using a general matrix, complex data ?gemm3m c, z Computes a scalar-matrix-matrix product using matrix multiplications and adds the result to a scalar-matrix product. mkl_?imatcopy s, d, c, z Performs scaling and in-place transposition/copying of matrices. mkl_?omatcopy s, d, c, z Performs scaling and out-of-place transposition/copying of matrices. mkl_?omatcopy2 s, d, c, z Performs two-strided scaling and out-of-place transposition/copying of matrices. mkl_?omatadd s, d, c, z Performs scaling and sum of two matrices including their out-of-place transposition/copying. ?axpby Scales two vectors, adds them to one another and stores result in the vector. Syntax Fortran 77: call saxpby(n, a, x, incx, b, y, incy) BLAS and Sparse BLAS Routines 2 327 call daxpby(n, a, x, incx, b, y, incy) call caxpby(n, a, x, incx, b, y, incy) call zaxpby(n, a, x, incx, b, y, incy) Fortran 95: call axpby(x, y [,a] [,b]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?axpby routines perform a vector-vector operation defined as y := a*x + b*y where: a and b are scalars x and y are vectors each with n elements. Input Parameters n INTEGER. Specifies the number of elements in vectors x and y. a REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Specifies the scalar a. x REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Array, DIMENSION at least (1 + (n-1)*abs(incx)). incx INTEGER. Specifies the increment for the elements of x. b REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Specifies the scalar b. y REAL for saxpby DOUBLE PRECISION for daxpby COMPLEX for caxpby DOUBLE COMPLEX for zaxpby Array, DIMENSION at least (1 + (n-1)*abs(incy)). incy INTEGER. Specifies the increment for the elements of y. Output Parameters y Contains the updated vector y. 2 Intel® Math Kernel Library Reference Manual 328 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine axpby interface are the following: x Holds the array of size n. y Holds the array of size n. a The default value is 1. b The default value is 1. ?gem2vu Computes two matrix-vector products using a general matrix (real data) Syntax Fortran 77: call sgem2vu(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) call dgem2vu(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) Fortran 95: call gem2vu(a, x1, x2, y1, y2 [,alpha][,beta] ) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gem2vu routines perform two matrix-vector operations defined as y1 := alpha*A*x1 + beta*y1, and y2 := alpha*A'*x2 + beta*y2, where: alpha and beta are scalars, x1, x2, y1, and y2 are vectors, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha REAL for sgem2vu DOUBLE PRECISION for dgem2vu BLAS and Sparse BLAS Routines 2 329 Specifies the scalar alpha. a REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x1 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(n-1)*abs(incx1)). Before entry, the incremented array x1 must contain the vector x1. incx1 INTEGER. Specifies the increment for the elements of x1. The value of incx1 must not be zero. x2 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(m-1)*abs(incx2)). Before entry, the incremented array x2 must contain the vector x2. incx2 INTEGER. Specifies the increment for the elements of x2. The value of incx2 must not be zero. beta REAL for sgem2vu DOUBLE PRECISION for dgem2vu Specifies the scalar beta. When beta is set to zero, then y1 and y2 need not be set on input. y1 REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(m-1)*abs(incy1)). Before entry with nonzero beta, the incremented array y1 must contain the vector y1. incy1 INTEGER. Specifies the increment for the elements of y1. The value of incy1 must not be zero. y REAL for sgem2vu DOUBLE PRECISION for dgem2vu Array, DIMENSION at least (1+(n-1)*abs(incy2)). Before entry with nonzero beta, the incremented array y2 must contain the vector y2. incy2 INTEGER. Specifies the increment for the elements of y2. The value of incy2 must not be zero. Output Parameters y1 Updated vector y1. y2 Updated vector y2. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gem2vu interface are the following: a Holds the matrix A of size (m,n). x1 Holds the vector with the number of elements rx1 where rx1 = n. x2 Holds the vector with the number of elements rx2 where rx2 = m. 2 Intel® Math Kernel Library Reference Manual 330 y1 Holds the vector with the number of elements ry1 where ry1 = m. y2 Holds the vector with the number of elements ry2 where ry2 = n. alpha The default value is 1. beta The default value is 0. ?gem2vc Computes two matrix-vector products using a general matrix (complex data) Syntax Fortran 77: call cgem2vc(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) call zgem2vc(m, n, alpha, a, lda, x1, incx1, x2, incx2, beta, y1, incy1, y2, incy2) Fortran 95: call gem2vc(a, x1, x2, y1, y2 [,alpha][,beta] ) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gem2vc routines perform two matrix-vector operations defined as y1 := alpha*A*x1 + beta*y1, and y2 := alpha*conjg(A')*x2 + beta*y2, where: alpha and beta are scalars, x1, x2, y1, and y2 are vectors, A is an m-by-n matrix. Input Parameters m INTEGER. Specifies the number of rows of the matrix A. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix A. The value of n must be at least zero. alpha COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Specifies the scalar alpha. a COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION (lda, n). Before entry, the leading m-by-n part of the array a must contain the matrix of coefficients. BLAS and Sparse BLAS Routines 2 331 lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. The value of lda must be at least max(1, m). x1 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(n-1)*abs(incx1)). Before entry, the incremented array x1 must contain the vector x1. incx1 INTEGER. Specifies the increment for the elements of x1. The value of incx1 must not be zero. x2 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(m-1)*abs(incx2)). Before entry, the incremented array x2 must contain the vector x2. incx2 INTEGER. Specifies the increment for the elements of x2. The value of incx2 must not be zero. beta COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Specifies the scalar beta. When beta is set to zero, then y1 and y2 need not be set on input. y1 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(m-1)*abs(incy1)). Before entry with nonzero beta, the incremented array y1 must contain the vector y1. incy1 INTEGER. Specifies the increment for the elements of y1. The value of incy1 must not be zero. y2 COMPLEX for cgem2vc DOUBLE COMPLEX for zgem2vc Array, DIMENSION at least (1+(n-1)*abs(incy2)). Before entry with nonzero beta, the incremented array y2 must contain the vector y2. incy2 INTEGER. Specifies the increment for the elements of y2. The value of incy must not be zero. INTEGER. Specifies the increment for the elements of y. Output Parameters y1 Updated vector y1. y2 Updated vector y2. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gem2vc interface are the following: a Holds the matrix A of size (m,n). x1 Holds the vector with the number of elements rx1 where rx1 = n. x2 Holds the vector with the number of elements rx2 where rx2 = m. y1 Holds the vector with the number of elements ry1 where ry1 = m. y2 Holds the vector with the number of elements ry2 where ry2 = n. alpha The default value is 1. 2 Intel® Math Kernel Library Reference Manual 332 beta The default value is 0. ?gemm3m Computes a scalar-matrix-matrix product using matrix multiplications and adds the result to a scalar-matrix product. Syntax Fortran 77: call cgemm3m(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) call zgemm3m(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) Fortran 95: call gemm3m(a, b, c [,transa][,transb] [,alpha][,beta]) Include Files • FORTRAN 77: mkl_blas.fi • Fortran 95: blas.f90 • C: mkl_blas.h Description The ?gemm3m routines perform a matrix-matrix operation with general complex matrices. These routines are similar to the ?gemm routines, but they use matrix multiplications(see Application Notes below). The operation is defined as C := alpha*op(A)*op(B) + beta*C, where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'), alpha and beta are scalars, A, B and C are matrices: op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix. Input Parameters transa CHARACTER*1. Specifies the form of op(A) used in the matrix multiplication: if transa = 'N' or 'n', then op(A) = A; if transa = 'T' or 't', then op(A) = A'; if transa = 'C' or 'c', then op(A) = conjg(A'). transb CHARACTER*1. Specifies the form of op(B) used in the matrix multiplication: if transb = 'N' or 'n', then op(B) = B; if transb = 'T' or 't', then op(B) = B'; if transb = 'C' or 'c', then op(B) = conjg(B'). BLAS and Sparse BLAS Routines 2 333 m INTEGER. Specifies the number of rows of the matrix op(A) and of the matrix C. The value of m must be at least zero. n INTEGER. Specifies the number of columns of the matrix op(B) and the number of columns of the matrix C. The value of n must be at least zero. k INTEGER. Specifies the number of columns of the matrix op(A) and the number of rows of the matrix op(B). The value of k must be at least zero. alpha COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Specifies the scalar alpha. a COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (lda, ka), where ka is k when transa= 'N' or 'n', and is m otherwise. Before entry with transa= 'N' or 'n', the leading mby- k part of the array a must contain the matrix A, otherwise the leading kby- m part of the array a must contain the matrix A. lda INTEGER. Specifies the leading dimension of a as declared in the calling (sub)program. When transa= 'N' or 'n', then lda must be at least max(1, m), otherwise lda must be at least max(1, k). b COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (ldb, kb), where kb is n when transb = 'N' or 'n', and is k otherwise. Before entry with transb = 'N' or 'n', the leading kby- n part of the array b must contain the matrix B, otherwise the leading nby- k part of the array b must contain the matrix B. ldb INTEGER. Specifies the leading dimension of b as declared in the calling (sub)program. When transb = 'N' or 'n', then ldb must be at least max(1, k), otherwise ldb must be at least max(1, n). beta COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Specifies the scalar beta. When beta is equal to zero, then c need not be set on input. c COMPLEX for cgemm3m DOUBLE COMPLEX for zgemm3m Array, DIMENSION (ldc, n). Before entry, the leading m-by-n part of the array c must contain the matrix C, except when beta is equal to zero, in which case c need not be set on entry. ldc INTEGER. Specifies the leading dimension of c as declared in the calling (sub)program. The value of ldc must be at least max(1, m). Output Parameters c Overwritten by the m-by-n matrix (alpha*op(A)*op(B) + beta*C). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 2 Intel® Math Kernel Library Reference Manual 334 Specific details for the routine gemm3m interface are the following: a Holds the matrix A of size (ma,ka) where ka = k if transa= 'N', ka = m otherwise, ma = m if transa= 'N', ma = k otherwise. b Holds the matrix B of size (mb,kb) where kb = n if transb = 'N', kb = k otherwise, mb = k if transb = 'N', mb = n otherwise. c Holds the matrix C of size (m,n). transa Must be 'N', 'C', or 'T'. The default value is 'N'. transb Must be 'N', 'C', or 'T'. The default value is 'N'. alpha The default value is 1. beta The default value is 1. Application Notes These routines perform the complex multiplication by forming the real and imaginary parts of the input matrices. It allows to use three real matrix multiplications and five real matrix additions, instead of the conventional four real matrix multiplications and two real matrix additions. The use of three real matrix multiplications only gives a 25% reduction of time in matrix operations. This can result in significant savings in computing time for large matrices. If the errors in the floating point calculations satisfy the following conditions: fl(x op y)=(x op y)(1+d),|d|=u, op=×,/, fl(x±y)=x(1+a)±y(1+ß), |a|,|ß|=u then for n-by-n matrix C=fl(C1+iC2)= fl((A1+iA2)(B1+iB2))=C1+iC2 the following estimations are correct ¦C1-C2¦= 2(n+1)u¦A¦8¦B¦8+O(u2), ¦C2-C1¦= 4(n+4)u¦A¦8¦B¦8+O(u2), where ¦A¦8=max(¦A1¦8,¦A2¦8), and ¦B¦8=max(¦B1¦8,¦B2¦8). and hence the matrix multiplications are stable. mkl_?imatcopy Performs scaling and in-place transposition/copying of matrices. Syntax Fortran: call mkl_simatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_dimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_cimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) call mkl_zimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda) BLAS and Sparse BLAS Routines 2 335 C: mkl_simatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_dimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_cimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); mkl_zimatcopy(ordering, trans, rows, cols, alpha, a, src_lda, dst_lda); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?imatcopy routine performs scaling and in-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: A := alpha*op(A). The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. a REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. Array, DIMENSION a(scr_lda,*). alpha REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. This parameter scales the input matrix by alpha. 2 Intel® Math Kernel Library Reference Manual 336 src_lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. dst_lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Output Parameters a REAL for mkl_simatcopy. DOUBLE PRECISION for mkl_dimatcopy. COMPLEX for mkl_cimatcopy. DOUBLE COMPLEX for mkl_zimatcopy. Array, DIMENSION at least m. Contains the matrix A. Interfaces FORTRAN 77: SUBROUTINE mkl_simatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld REAL a(*), alpha* SUBROUTINE mkl_dimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld DOUBLE PRECISION a(*), alpha* SUBROUTINE mkl_cimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld COMPLEX a(*), alpha* SUBROUTINE mkl_zimatcopy ( ordering, trans, rows, cols, alpha, a, src_lda, dst_lda ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld DOUBLE COMPLEX a(*), alpha* C: void mkl_simatcopy(char ordering, char trans, size_t rows, size_t cols, float *alpha, float *a, size_t src_lda, size_t dst_lda); BLAS and Sparse BLAS Routines 2 337 void mkl_dimatcopy(char ordering, char trans, size_t rows, size_t cols, double *alpha, float *a, size_t src_lda, size_t dst_lda); void mkl_cimatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 *alpha, MKL_Complex8 *a, size_t src_lda, size_t dst_lda); void mkl_zimatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 *alpha, MKL_Complex16 *a, size_t src_lda, size_t dst_lda); mkl_?omatcopy Performs scaling and out-place transposition/copying of matrices. Syntax Fortran: call mkl_somatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_domatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_comatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) call mkl_zomatcopy(ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld) C: mkl_somatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_domatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_comatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); mkl_zomatcopy(ordering, trans, rows, cols, alpha, SRC, src_stride, DST, dst_stride); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatcopy routine performs scaling and out-of-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: B := alpha*op(A) The routine parameter descriptions are common for all implemented interfaces with the exception of data types that mostly refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. 2 Intel® Math Kernel Library Reference Manual 338 If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. alpha REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. This parameter scales the input matrix by alpha. src REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION src(scr_ld,*). src_ld INTEGER. (Fortran interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. src_stride INTEGER. (C interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. dst REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION dst(dst_ld,*). dst_ld INTEGER. (Fortran interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) BLAS and Sparse BLAS Routines 2 339 • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) dst_stride INTEGER. (C interface). Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: If ordering = 'C' or 'c', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Output Parameters dst REAL for mkl_somatcopy. DOUBLE PRECISION for mkl_domatcopy. COMPLEX for mkl_comatcopy. DOUBLE COMPLEX for mkl_zomatcopy. Array, DIMENSION at least m. Contains the destination matrix. Interfaces FORTRAN 77: SUBROUTINE mkl_somatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_ld, dst_ld REAL alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_domatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda DOUBLE PRECISION alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_comatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda COMPLEX alpha, dst(dst_ld,*), src(src_ld,*) SUBROUTINE mkl_zomatcopy ( ordering, trans, rows, cols, alpha, src, src_ld, dst, dst_ld ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_lda, dst_lda DOUBLE COMPLEX alpha, dst(dst_ld,*), src(src_ld,*) C: void mkl_somatcopy(char ordering, char trans, size_t rows, size_t cols, float alpha, float *SRC, size_t src_stride, float *DST, size_t dst_stride); 2 Intel® Math Kernel Library Reference Manual 340 void mkl_domatcopy(char ordering, char trans, size_t rows, size_t cols, double alpha, double *SRC, size_t src_stride, double *DST, size_t dst_stride); void mkl_comatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 alpha, MKL_Complex8 *SRC, size_t src_stride, MKL_Complex8 *DST, size_t dst_stride); void mkl_zomatcopy(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 alpha, MKL_Complex16 *SRC, size_t src_stride, MKL_Complex16 *DST, size_t dst_stride); mkl_?omatcopy2 Performs two-strided scaling and out-of-place transposition/copying of matrices. Syntax Fortran: call mkl_somatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_domatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_comatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) call mkl_zomatcopy2(ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col) C: mkl_somatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_domatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_comatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); mkl_zomatcopy2(ordering, trans, rows, cols, alpha, SRC, src_row, src_col, DST, dst_row, dst_col); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatcopy2 routine performs two-strided scaling and out-of-place transposition/copying of matrices. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows: B := alpha*op(A) Normally, matrices in the BLAS or LAPACK are specified by a single stride index. For instance, in the columnmajor order, A(2,1) is stored in memory one element away from A(1,1), but A(1,2) is a leading dimension away. The leading dimension in this case is the single stride. If a matrix has two strides, then both A(2,1) and A(1,2) may be an arbitrary distance from A(1,1). BLAS and Sparse BLAS Routines 2 341 The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. trans CHARACTER*1. Parameter that specifies the operation type. If trans = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If trans = 'T' or 't', it is assumed that A should be transposed. If trans = 'C' or 'c', it is assumed that A should be conjugate transposed. If trans = 'R' or 'r', it is assumed that A should be only conjugated. If the data is real, then trans = 'R' is the same as trans = 'N', and trans = 'C' is the same as trans = 'T'. rows INTEGER. The number of matrix rows. cols INTEGER. The number of matrix columns. alpha REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. This parameter scales the input matrix by alpha. src REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION src(*). src_row INTEGER. Distance between the first elements in adjacent rows in the source matrix; measured in the number of elements. This parameter must be at least max(1,rows). src_col INTEGER. Distance between the first elements in adjacent columns in the source matrix; measured in the number of elements. This parameter must be at least max(1,cols). dst REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION dst(*). dst_row INTEGER. Distance between the first elements in adjacent rows in the destination matrix; measured in the number of elements. To determine the minimum value of dst_row on output, consider the following guideline: • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) 2 Intel® Math Kernel Library Reference Manual 342 dst_col INTEGER. Distance between the first elements in adjacent columns in the destination matrix; measured in the number of elements. To determine the minimum value of dst_lda on output, consider the following guideline: • If trans = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If trans = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) Output Parameters dst REAL for mkl_somatcopy2. DOUBLE PRECISION for mkl_domatcopy2. COMPLEX for mkl_comatcopy2. DOUBLE COMPLEX for mkl_zomatcopy2. Array, DIMENSION at least m. Contains the destination matrix. Interfaces FORTRAN 77: SUBROUTINE mkl_somatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col REAL alpha, dst(*), src(*) SUBROUTINE mkl_domatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col DOUBLE PRECISION alpha, dst(*), src(*) SUBROUTINE mkl_comatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col COMPLEX alpha, dst(*), src(*) SUBROUTINE mkl_zomatcopy2 ( ordering, trans, rows, cols, alpha, src, src_row, src_col, dst, dst_row, dst_col ) CHARACTER*1 ordering, trans INTEGER rows, cols, src_row, src_col, dst_row, dst_col DOUBLE COMPLEX alpha, dst(*), src(*) C: void mkl_somatcopy2(char ordering, char trans, size_t rows, size_t cols, float *alpha, float *SRC, size_t src_row, size_t src_col, float *DST, size_t dst_row, size_t dst_col); void mkl_domatcopy2(char ordering, char trans, size_t rows, size_t cols, float *alpha, double *SRC, size_t src_row, size_t src_col, double *DST, size_t dst_row, size_t dst_col); void mkl_comatcopy2(char ordering, char trans, size_t rows, size_t cols, MKL_Complex8 *alpha, MKL_Complex8 *SRC, size_t src_row, size_t src_col, MKL_Complex8 *DST, size_t dst_row, size_t dst_col); void mkl_zomatcopy2(char ordering, char trans, size_t rows, size_t cols, MKL_Complex16 *alpha, MKL_Complex16 *SRC, size_t src_row, size_t src_col, MKL_Complex16 *DST, size_t dst_row, size_t dst_col); BLAS and Sparse BLAS Routines 2 343 mkl_?omatadd Performs scaling and sum of two matrices including their out-of-place transposition/copying. Syntax Fortran: call mkl_somatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_domatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_comatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) call mkl_zomatadd(ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc) C: mkl_somatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_domatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_comatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); mkl_zomatadd(ordering, transa, transb, m, n, alpha, A, lda, beta, B, ldb, C, ldc); Include Files • FORTRAN 77: mkl_trans.fi • C: mkl_trans.h Description The mkl_?omatadd routine scaling and sum of two matrices including their out-of-place transposition/ copying. A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The following out-of-place memory movement is done: C := alpha*op(A) + beta*op(B) op(A) is either transpose, conjugate-transpose, or leave alone depending on transa. If no transposition of the source matrices is required, m is the number of rows and n is the number of columns in the source matrices A and B. In this case, the output matrix C is m-by-n. The routine parameter descriptions are common for all implemented interfaces with the exception of data types that refer here to the FORTRAN 77 standard types. Data types specific to the different interfaces are described in the section "Interfaces" below. Note that different arrays should not overlap. Input Parameters ordering CHARACTER*1. Ordering of the matrix storage. If ordering = 'R' or 'r', the ordering is row-major. If ordering = 'C' or 'c', the ordering is column-major. transa CHARACTER*1. Parameter that specifies the operation type on matrix A. If transa = 'N' or 'n', op(A)=A and the matrix A is assumed unchanged on input. If transa = 'T' or 't', it is assumed that A should be transposed. If transa = 'C' or 'c', it is assumed that A should be conjugate transposed. If transa = 'R' or 'r', it is assumed that A should be only conjugated. 2 Intel® Math Kernel Library Reference Manual 344 If the data is real, then transa = 'R' is the same as transa = 'N', and transa = 'C' is the same as transa = 'T'. transb CHARACTER*1. Parameter that specifies the operation type on matrix B. If transb = 'N' or 'n', op(B)=B and the matrix B is assumed unchanged on input. If transb = 'T' or 't', it is assumed that B should be transposed. If transb = 'C' or 'c', it is assumed that B should be conjugate transposed. If transb = 'R' or 'r', it is assumed that B should be only conjugated. If the data is real, then transb = 'R' is the same as transb = 'N', and transb = 'C' is the same as transb = 'T'. m INTEGER. The number of matrix rows. n INTEGER. The number of matrix columns. alpha REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. This parameter scales the input matrix by alpha. a REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION a(lda,*). lda INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix A; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. beta REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. This parameter scales the input matrix by beta. b REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION b(ldb,*). ldb INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the source matrix B; measured in the number of elements. This parameter must be at least max(1,rows) if ordering = 'C' or 'c', and max(1,cols) otherwise. Output Parameters c REAL for mkl_somatadd. DOUBLE PRECISION for mkl_domatadd. COMPLEX for mkl_comatadd. DOUBLE COMPLEX for mkl_zomatadd. Array, DIMENSION c(ldc,*). BLAS and Sparse BLAS Routines 2 345 ldc INTEGER. Distance between the first elements in adjacent columns (in the case of the column-major order) or rows (in the case of the row-major order) in the destination matrix C; measured in the number of elements. To determine the minimum value of ldc, consider the following guideline: If ordering = 'C' or 'c', then • If transa or transb = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,rows) • If transa or transb = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,cols) If ordering = 'R' or 'r', then • If transa or transb = 'T' or 't' or 'C' or 'c', this parameter must be at least max(1,cols) • If transa or transb = 'N' or 'n' or 'R' or 'r', this parameter must be at least max(1,rows) Interfaces FORTRAN 77: SUBROUTINE mkl_somatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc REAL alpha, beta REAL a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_domatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc DOUBLE PRECISION alpha, beta DOUBLE PRECISION a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_comatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc COMPLEX alpha, beta COMPLEX a(lda,*), b(ldb,*), c(ldc,*) SUBROUTINE mkl_zomatadd ( ordering, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc ) CHARACTER*1 ordering, transa, transb INTEGER m, n, lda, ldb, ldc DOUBLE COMPLEX alpha, beta DOUBLE COMPLEX a(lda,*), b(ldb,*), c(ldc,*) C: void mkl_somatadd(char ordering, char transa, char transb, size_t m, size_t n, float *alpha, float *A, size_t lda, float *beta, float *B, size_t ldb, float *C, size_t ldc); void mkl_domatadd(char ordering, char transa, char transb, size_t m, size_t n, double *alpha, double *A, size_t lda, double *beta, float *B, size_t ldb, double *C, size_t ldc); void mkl_comatadd(char ordering, char transa, char transb, size_t m, size_t n, MKL_Complex8 *alpha, MKL_Complex8 *A, size_t lda, float *beta, float *B, size_t ldb, MKL_Complex8 *C, size_t ldc); void mkl_zomatadd(char ordering, char transa, char transb, size_t m, size_t n, MKL_Complex16 *alpha, MKL_Complex16 *A, size_t lda, float *beta, float *B, size_t ldb, MKL_Complex16 *C, size_t ldc); 2 Intel® Math Kernel Library Reference Manual 346 LAPACK Routines: Linear Equations 3 This chapter describes the Intel® Math Kernel Library implementation of routines from the LAPACK package that are used for solving systems of linear equations and performing a number of related computational tasks. The library includes LAPACK routines for both real and complex data. Routines are supported for systems of equations with the following types of matrices: • general • banded • symmetric or Hermitian positive-definite (full, packed, and rectangular full packed (RFP) storage) • symmetric or Hermitian positive-definite banded • symmetric or Hermitian indefinite (both full and packed storage) • symmetric or Hermitian indefinite banded • triangular (full, packed, and RFP storage) • triangular banded • tridiagonal • diagonally dominant tridiagonal. For each of the above matrix types, the library includes routines for performing the following computations: – factoring the matrix (except for triangular matrices) – equilibrating the matrix (except for RFP matrices) – solving a system of linear equations – estimating the condition number of a matrix (except for RFP matrices) – refining the solution of linear equations and computing its error bounds (except for RFP matrices) – inverting the matrix. To solve a particular problem, you can call two or more computational routines or call a corresponding driver routine that combines several tasks in one call. For example, to solve a system of linear equations with a general matrix, call ?getrf (LU factorization) and then ?getrs (computing the solution). Then, call ?gerfs to refine the solution and get the error bounds. Alternatively, use the driver routine ?gesvx that performs all these tasks in one call. WARNING LAPACK routines assume that input matrices do not contain IEEE 754 special values such as INF or NaN values. Using these special values may cause LAPACK to return unexpected results or become unstable. Starting from release 8.0, Intel MKL along with the FORTRAN 77 interface to LAPACK computational and driver routines also supports the Fortran 95 interface that uses simplified routine calls with shorter argument lists. The syntax section of the routine description gives the calling sequence for the Fortran 95 interface, where available, immediately after the FORTRAN 77 calls. Routine Naming Conventions To call each routine introduced in this chapter from the FORTRAN 77 program, you can use the LAPACK name. LAPACK names are listed in Table "Computational Routines for Systems of Equations with Real Matrices" and Table "Computational Routines for Systems of Equations with Complex Matrices", and have the structure ?yyzzz or ?yyzz, which is described below. The initial symbol ? indicates the data type: s real, single precision 347 c complex, single precision d real, double precision z complex, double precision Some routines can have combined character codes, such as ds or zc. The second and third letters yy indicate the matrix type and storage scheme: ge general gb general band gt general tridiagonal dt diagonally dominant tridiagonal po symmetric or Hermitian positive-definite pp symmetric or Hermitian positive-definite (packed storage) pf symmetric or Hermitian positive-definite (RFP storage) pb symmetric or Hermitian positive-definite band pt symmetric or Hermitian positive-definite tridiagonal sy symmetric indefinite sp symmetric indefinite (packed storage) he Hermitian indefinite hp Hermitian indefinite (packed storage) tr triangular tp triangular (packed storage) tf triangular (RFP storage) tb triangular band The last three letters zzz indicate the computation performed: trf perform a triangular matrix factorization trs solve the linear system with a factored matrix con estimate the matrix condition number rfs refine the solution and compute error bounds rfsx refine the solution and compute error bounds using extra-precise iterative refinement tri compute the inverse matrix using the factorization equ, equb equilibrate a matrix. For example, the sgetrf routine performs the triangular factorization of general real matrices in single precision; the corresponding routine for complex matrices is cgetrf. Driver routine names can end with -sv (meaning a simple driver), or with -svx (meaning an expert driver) or with -svxx (meaning an extra-precise iterative refinement expert driver). The Fortran 95 interfaces to the LAPACK computational and driver routines are the same as the FORTRAN 77 names but without the first letter that indicates the data type. For example, the name of the routine that performs a triangular factorization of general real matrices in Fortran 95 is getrf. Different data types are handled through the definition of a specific internal parameter that refers to a module block with named constants for single and double precision. C Interface Conventions The C interfaces are implemented for most of the Intel MKL LAPACK driver and computational routines. The arguments of the C interfaces for the Intel MKL LAPACK functions comply with the following rules: • Scalar input arguments are passed by value. 3 Intel® Math Kernel Library Reference Manual 348 • Array arguments are passed by reference. • Array input arguments are declared with the const modifier. • Function arguments are passed by pointer. • An integer return value replaces the info output parameter. The return value equal to 0 means the function operation is completed successfully. See also special error codes below. Matrix Order Most of the LAPACK C interfaces have an additional parameter matrix_order of type int as their first argument. This parameter specifies whether the two-dimensional arrays are row-major (LAPACK_ROW_MAJOR) or column-major (LAPACK_COL_MAJOR). In general the leading dimension lda is equal to the number of elements in the major dimension. It is also equal to the distance in elements between two neighboring elements in a line in the minor dimension. If there are no extra elements in a matrix with m rows and n columns, then • For row-major ordering: the number of elements in a row is n, and row i is stored in memory right after row i-1. Therefore lda is n. • For column-major ordering: the number of elements in a column is m, and column i is stored in memory right after column i-1. Therefore lda is m. To refer to a submatrix with dimensions k by l, use the number of elements in the major dimension of the whole matrix (as above) as the leading dimension and k and l in the subroutine's input parameters to describe the size of the submatrix. Workspace Arrays The LAPACK C interface omits workspace parameters because workspace is allocated during runtime and released upon completion of the function operation. For some functions, work arrays contain valuable information on exit. In such cases, the interface contains an additional argument or arguments, namely: • ?gesvx and ?gbsvx contain rpivot • ?gesvd contains superb • ?gejsv and ?gesvj contain istat and stat, respectively. Function Types The function types are used in non-symmetric eigenproblem functions only. typedef lapack_logical (*LAPACK_S_SELECT2) (const float*, const float*); typedef lapack_logical (*LAPACK_S_SELECT3) (const float*, const float*, const float*); typedef lapack_logical (*LAPACK_D_SELECT2) (const double*, const double*); typedef lapack_logical (*LAPACK_D_SELECT3) (const double*, const double*, const double*); LAPACK Routines: Linear Equations 3 349 typedef lapack_logical (*LAPACK_C_SELECT1) (const lapack_complex_float*); typedef lapack_logical (*LAPACK_C_SELECT2) (const lapack_complex_float*, const lapack_complex_float*); typedef lapack_logical (*LAPACK_Z_SELECT1) (const lapack_complex_double*); typedef lapack_logical (*LAPACK_Z_SELECT2) (const lapack_complex_double*, const lapack_complex_double*); Mapping FORTRAN Data Types against C Data Types FORTRAN Data Types vs. C Data Types FORTRAN C INTEGER lapack_int LOGICAL lapack_logical REAL float DOUBLE PRECISION double COMPLEX lapack_complex_float COMPLEX*16/DOUBLE COMPLEX lapack_complex_double CHARACTER char C Type Definitions #ifndef lapack_int #define lapack_int MKL_INT #endif #ifndef lapack_logical #define lapack_logical lapack_int #endif Complex Type Definitions Complex type for single precision: #ifndef lapack_complex_float #define lapack_complex_float MKL_Complex8 #endif Complex type for double precision: #ifndef lapack_complex_double #define lapack_complex_double MKL_Complex16 #endif Matrix Order Definitions #define LAPACK_ROW_MAJOR 101 #define LAPACK_COL_MAJOR 102 See Matrix Order for an explanation of row-major order and column-major order storage. Error Code Definitions #define LAPACK_WORK_MEMORY_ERROR -1010 /* Failed to allocate memory for a working array */ #define LAPACK_TRANSPOSE_MEMORY_ERROR -1011 /* Failed to allocate memory for transposed matrix */ If the return value is -i, the -i-th parameter has an invalid value. 3 Intel® Math Kernel Library Reference Manual 350 Function Prototypes Some Intel MKL functions differ in data types they support and vary in the parameters they take. Each function type has a unique prototype defined. Use this prototype when you call the function from your application program. In most cases, Intel MKL supports four distinct floating-point precisions. Each corresponding prototype looks similar, usually differing only in the data type. To avoid listing all the prototypes in every supported precision, a generic prototype template is provided. denotes precision and is s, d, c, or z: • s for real, single precision • d for real, double precision • c for complex, single precision • z for complex, double precision stands for a respective data type: float, double, lapack_complex_float, or lapack_complex_double. For example, the C prototype template for the ?pptrs function that solves a system of linear equations with a packed Cholesky-factored symmetric (Hermitian) positive-definite matrix looks as follows: lapack_int LAPACKE_pptrs(int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb); To obtain the function name and parameter list that corresponds to a specific precision, replace the symbol with s, d, c, or z and the field with the corresponding data type (float, double, lapack_complex_float, or lapack_complex_double respectively). A specific example follows. To solve a system of linear equations with a packed Cholesky-factored Hermitian positive-definite matrix with complex precision, use the following: lapack_int LAPACKE_cpptrs(int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* b, lapack_int ldb); NOTE For the select parameter, the respective values of the field for s, d, c, or z are as follows: LAPACK_S_SELECT3, LAPACK_D_SELECT3, LAPACK_C_SELECT2, and LAPACK_Z_SELECT2. Fortran 95 Interface Conventions Intel® MKL implements the Fortran 95 interface to LAPACK through wrappers that call respective FORTRAN 77 routines. This interface uses such Fortran 95 features as assumed-shape arrays and optional arguments to provide simplified calls to LAPACK routines with fewer arguments. NOTE For LAPACK, Intel MKL offers two types of the Fortran 95 interfaces: • using mkl_lapack.fi only through the include ‘mkl_lapack.fi’ statement. Such interfaces allow you to make use of the original LAPACK routines with all their arguments • using lapack.f90 that includes improved interfaces. This file is used to generate the module files lapack95.mod and f95_precision.mod. The module files mkl95_lapack.mod and mkl95_precision.mod are also generated. See also the section "Fortran 95 interfaces and wrappers to LAPACK and BLAS" of the Intel® MKL User's Guide for details. The module files are used to process the FORTRAN use clauses referencing the LAPACK interface: use lapack95 (or an equivalent use mkl95_lapack) and use f95_precision (or an equivalent use mkl95_precision). The main conventions for the Fortran 95 interface are as follows: LAPACK Routines: Linear Equations 3 351 • The names of arguments used in Fortran 95 call are typically the same as for the respective generic (FORTRAN 77) interface. In rare cases, formal argument names may be different. For instance, select instead of selctg. • Input arguments such as array dimensions are not required in Fortran 95 and are skipped from the calling sequence. Array dimensions are reconstructed from the user data that must exactly follow the required array shape. Another type of generic arguments that are skipped in the Fortran 95 interface are arguments that represent workspace arrays (such as work, rwork, and so on). The only exception are cases when workspace arrays return significant information on output. An argument can also be skipped if its value is completely defined by the presence or absence of another argument in the calling sequence, and the restored value is the only meaningful value for the skipped argument. • Some generic arguments are declared as optional in the Fortran 95 interface and may or may not be present in the calling sequence. An argument can be declared optional if it meets one of the following conditions: – If an argument value is completely defined by the presence or absence of another argument in the calling sequence, it can be declared optional. The difference from the skipped argument in this case is that the optional argument can have some meaningful values that are distinct from the value reconstructed by default. For example, if some argument (like jobz) can take only two values and one of these values directly implies the use of another argument, then the value of jobz can be uniquely reconstructed from the actual presence or absence of this second argument, and jobz can be omitted. – If an input argument can take only a few possible values, it can be declared as optional. The default value of such argument is typically set as the first value in the list and all exceptions to this rule are explicitly stated in the routine description. – If an input argument has a natural default value, it can be declared as optional. The default value of such optional argument is set to its natural default value. • Argument info is declared as optional in the Fortran 95 interface. If it is present in the calling sequence, the value assigned to info is interpreted as follows: – If this value is more than -1000, its meaning is the same as in the FORTRAN 77 routine. – If this value is equal to -1000, it means that there is not enough work memory. – If this value is equal to -1001, incompatible arguments are present in the calling sequence. – If this value is equal to -i, the ith parameter (counting parameters in the FORTRAN 77 interface, not the Fortran 95 interface) had an illegal value. • Optional arguments are given in square brackets in the Fortran 95 call syntax. The "Fortran 95 Notes" subsection at the end of the topic describing each routine details concrete rules for reconstructing the values of the omitted optional parameters. Intel® MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation The following list presents general digressions of the Intel MKL LAPACK95 implementation from the Netlib analog: • The Intel MKL Fortran 95 interfaces are provided for pure procedures. • Names of interfaces do not contain the LA_ prefix. • An optional array argument always has the target attribute. • Functionality of the Intel MKL LAPACK95 wrapper is close to the FORTRAN 77 original implementation in the getrf, gbtrf, and potrf interfaces. • If jobz argument value specifies presence or absence of z argument, then z is always declared as optional and jobz is restored depending on whether z is present or not. It is not always so in the Netlib version (see "Modified Netlib Interfaces" in Appendix E). • To avoid double error checking, processing of the info argument is limited to checking of the allocated memory and disarranging of optional arguments. • If an argument that is present in the list of arguments completely defines another argument, the latter is always declared as optional. 3 Intel® Math Kernel Library Reference Manual 352 You can transform an application that uses the Netlib LAPACK interfaces to ensure its work with the Intel MKL interfaces providing that: a. The application is correct, that is, unambiguous, compiler-independent, and contains no errors. b. Each routine name denotes only one specific routine. If any routine name in the application coincides with a name of the original Netlib routine (for example, after removing the LA_ prefix) but denotes a routine different from the Netlib original routine, this name should be modified through context name replacement. You should transform your application in the following cases (see Appendix E for specific differences of individual interfaces): • When using the Netlib routines that differ from the Intel MKL routines only by the LA_ prefix or in the array attribute target. The only transformation required in this case is context name replacement. See "Interfaces Identical to Netlib" in Appendix E for details. • When using Netlib routines that differ from the Intel MKL routines by the LA_ prefix, the target array attribute, and the names of formal arguments. In the case of positional passing of arguments, no additional transformation except context name replacement is required. In the case of the keywords passing of arguments, in addition to the context name replacement the names of mismatching keywords should also be modified. See "Interfaces with Replaced Argument Names" in Appendix E for details. • When using the Netlib routines that differ from the respective Intel MKL routines by the LA_ prefix, the target array attribute, sequence of the arguments, arguments missing in Intel MKL but present in Netlib and, vice versa, present in Intel MKL but missing in Netlib. Remove the differences in the sequence and range of the arguments in process of all the transformations when you use the Netlib routines specified by this bullet and the preceding bullet. See "Modified Netlib Interfaces" in Appendix E for details. • When using the getrf, gbtrf, and potrf interfaces, that is, new functionality implemented in Intel MKL but unavailable in the Netlib source. To override the differences, build the desired functionality explicitly with the Intel MKL means or create a new subroutine with the new functionality, using specific MKL interfaces corresponding to LAPACK 77 routines. You can call the LAPACK 77 routines directly but using the new Intel MKL interfaces is preferable. See "Interfaces Absent From Netlib" and "Interfaces of New Functionality" in Appendix E for details. Note that if the transformed application calls getrf, gbtrf or potrf without controlling arguments rcond and norm, just context name replacement is enough in modifying the calls into the Intel MKL interfaces, as described in the first bullet above. The Netlib functionality is preserved in such cases. • When using the Netlib auxiliary routines. In this case, call a corresponding subroutine directly, using the Intel MKL LAPACK 77 interfaces. Transform your application as follows: 1. Make sure conditions a. and b. are met. 2. Select Netlib LAPACK 95 calls. For each call, do the following: • Select the type of digression and do the required transformations. • Revise results to eliminate unneeded code or data, which may appear after several identical calls. 3. Make sure the transformations are correct and complete. Matrix Storage Schemes LAPACK routines use the following matrix storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: an m-by-n band matrix with kl sub-diagonals and ku superdiagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. • Rectangular Full Packed (RFP) storage: the upper or lower triangle of the matrix is packed combining the full and packed storage schemes. This combination enables using half of the full storage as packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels as the full storage. LAPACK Routines: Linear Equations 3 353 In Chapters 4 and 5, arrays that hold matrices in packed storage have names ending in p; arrays with matrices in band storage have names ending in b; arrays with matrices in the RFP storage have names ending in fp. For more information on matrix storage schemes, see "Matrix Arguments" in Appendix B. Mathematical Notation Descriptions of LAPACK routines use the following notation: Ax = b A system of linear equations with an n-by-n matrix A = {aij}, a right-hand side vector b = {bi}, and an unknown vector x = {xi}. AX = B A set of systems with a common matrix A and multiple right-hand sides. The columns of B are individual right-hand sides, and the columns of X are the corresponding solutions. |x| the vector with elements |xi| (absolute values of xi). |A| the matrix with elements |aij| (absolute values of aij). ||x||8 = maxi|xi| The infinity-norm of the vector x. ||A||8 = maxiSj|aij| The infinity-norm of the matrix A. ||A||1 = maxjSi|aij| The one-norm of the matrix A. ||A||1 = ||AT||8 = ||AH||8 ?(A) = ||A|| ||A-1|| The condition number of the matrix A. Error Analysis In practice, most computations are performed with rounding errors. Besides, you often need to solve a system Ax = b, where the data (the elements of A and b) are not known exactly. Therefore, it is important to understand how the data errors and rounding errors can affect the solution x. Data perturbations. If x is the exact solution of Ax = b, and x + dx is the exact solution of a perturbed problem (A + dA)x = (b + db), then where In other words, relative errors in A or b may be amplified in the solution vector x by a factor ?(A) = ||A|| ||A-1|| called the condition number of A. Rounding errors have the same effect as relative perturbations c(n)e in the original data. Here e is the machine precision, and c(n) is a modest function of the matrix order n. The corresponding solution error is ||dx||/||x||= c(n)?(A)e. (The value of c(n) is seldom greater than 10n.) Thus, if your matrix A is ill-conditioned (that is, its condition number ?(A) is very large), then the error in the solution x is also large; you may even encounter a complete loss of precision. LAPACK provides routines that allow you to estimate ?(A) (see Routines for Estimating the Condition Number) and also give you a more precise estimate for the actual solution error (see Refining the Solution and Estimating Its Error). 3 Intel® Math Kernel Library Reference Manual 354 Computational Routines Table "Computational Routines for Systems of Equations with Real Matrices" lists the LAPACK computational routines (FORTRAN 77 and Fortran 95 interfaces) for factorizing, equilibrating, and inverting real matrices, estimating their condition numbers, solving systems of equations with real matrices, refining the solution, and estimating its error. Table "Computational Routines for Systems of Equations with Complex Matrices" lists similar routines for complex matrices. Respective routine names in the Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Systems of Equations with Real Matrices Matrix type, storage scheme Factorize matrix Equilibrate matrix Solve system Condition number Estimate error Invert matrix general ?getrf ?geequ, ?geequb ?getrs ?gecon ?gerfs, ?gerfsx ?getri general band ?gbtrf ?gbequ, ?gbequb ?gbtrs ?gbcon ?gbrfs, ?gbrfsx general tridiagonal ?gttrf ?gttrs ?gtcon ?gtrfs diagonally dominant tridiagonal ?dttrfb ?dttrsb symmetric positive-definite ?potrf ?poequ, ?poequb ?potrs ?pocon ?porfs, ?porfsx ?potri symmetric positive-definite, packed storage ?pptrf ?ppequ ?pptrs ?ppcon ?pprfs ?pptri symmetric positive-definite, RFP storage ?pftrf ?pftrs ?pftri symmetric positive-definite, band ?pbtrf ?pbequ ?pbtrs ?pbcon ?pbrfs symmetric positive-definite, tridiagonal ?pttrf ?pttrs ?ptcon ?ptrfs symmetric indefinite ?sytrf ?syequb ?sytrs ?sytrs2 ?sycon ?syconv ?syrfs, ?syrfsx ?sytri ?sytri2 ?sytri2x symmetric indefinite, packed storage ?sptrf ?sptrs ?spcon ?sprfs ?sptri triangular ?trtrs ?trcon ?trrfs ?trtri triangular, packed storage ?tptrs ?tpcon ?tprfs ?tptri triangular, RFP storage ?tftri triangular band ?tbtrs ?tbcon ?tbrfs LAPACK Routines: Linear Equations 3 355 In the table above, ? denotes s (single precision) or d (double precision) for the FORTRAN 77 interface. Computational Routines for Systems of Equations with Complex Matrices Matrix type, storage scheme Factorize matrix Equilibrate matrix Solve system Condition number Estimate error Invert matrix general ?getrf ?geequ, ?geequb ?getrs ?gecon ?gerfs, ?gerfsx ?getri general band ?gbtrf ?gbequ, ?gbequb ?gbtrs ?gbcon ?gbrfs, ?gbrfsx general tridiagonal ?gttrf ?gttrs ?gtcon ?gtrfs Hermitian positive-definite ?potrf ?poequ, ?poequb ?potrs ?pocon ?porfs, ?porfsx ?potri Hermitian positive-definite, packed storage ?pptrf ?ppequ ?pptrs ?ppcon ?pprfs ?pptri Hermitian positive-definite, RFP storage ?pftrf ?pftrs ?pftri Hermitian positive-definite, band ?pbtrf ?pbequ ?pbtrs ?pbcon ?pbrfs Hermitian positive-definite, tridiagonal ?pttrf ?pttrs ?ptcon ?ptrfs Hermitian indefinite ?hetrf ?heequb ?hetrs ?hetrs2 ?hecon ?herfs, ?herfsx ?hetri ?hetri2 ?hetri2x symmetric indefinite ?sytrf ?syequb ?sytrs ?sytrs2 ?sycon ?syconv ?syrfs, ?syrfsx ?sytri ?sytri2 ?sytri2x Hermitian indefinite, packed storage ?hptrf ?hptrs ?hpcon ?hprfs ?hptri symmetric indefinite, packed storage ?sptrf ?sptrs ?spcon ?sprfs ?sptri triangular ?trtrs ?trcon ?trrfs ?trtri triangular, packed storage ?tptrs ?tpcon ?tprfs ?tptri triangular, RFP storage ?tftri triangular band ?tbtrs ?tbcon ?tbrfs In the table above, ? stands for c (single precision complex) or z (double precision complex) for FORTRAN 77 interface. 3 Intel® Math Kernel Library Reference Manual 356 Routines for Matrix Factorization This section describes the LAPACK routines for matrix factorization. The following factorizations are supported: • LU factorization • Cholesky factorization of real symmetric positive-definite matrices • Cholesky factorization of real symmetric positive-definite matrices with pivoting • Cholesky factorization of Hermitian positive-definite matrices • Cholesky factorization of Hermitian positive-definite matrices with pivoting • Bunch-Kaufman factorization of real and complex symmetric matrices • Bunch-Kaufman factorization of Hermitian matrices. You can compute: • the LU factorization using full and band storage of matrices • the Cholesky factorization using full, packed, RFP, and band storage • the Bunch-Kaufman factorization using full and packed storage. ?getrf Computes the LU factorization of a general m-by-n matrix. Syntax Fortran 77: call sgetrf( m, n, a, lda, ipiv, info ) call dgetrf( m, n, a, lda, ipiv, info ) call cgetrf( m, n, a, lda, ipiv, info ) call zgetrf( m, n, a, lda, ipiv, info ) Fortran 95: call getrf( a [,ipiv] [,info] ) C: lapack_int LAPACKE_getrf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the LU factorization of a general m-by-n matrix A as A = P*L*U, where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). The routine uses partial pivoting, with row interchanges. LAPACK Routines: Linear Equations 3 357 NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A; n = 0. a REAL for sgetrf DOUBLE PRECISION for dgetrf COMPLEX for cgetrf DOUBLE COMPLEX for zgetrf. Array, DIMENSION (lda,*). Contains the matrix A. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of array a. Output Parameters a Overwritten by L and U. The unit diagonal elements of L are not stored. ipiv INTEGER. Array, DIMENSION at least max(1,min(m, n)). The pivot indices; for 1 = i = min(m, n), row i was interchanged with row ipiv(i). info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getrf interface are as follows: a Holds the matrix A of size (m,n). ipiv Holds the vector of length min(m,n). Application Notes The computed L and U are the exact factors of a perturbed matrix A + E, where |E| = c(min(m,n))e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. The approximate number of floating-point operations for real flavors is (2/3)n3 If m = n, (1/3)n2(3m-n) If m > n, (1/3)m2(3n-m) If m < n. The number of operations for complex flavors is four times greater. 3 Intel® Math Kernel Library Reference Manual 358 After calling this routine with m = n, you can call the following: ?getrs to solve A*x = B or ATX = B or AHX = B ?gecon to estimate the condition number of A ?getri to compute the inverse of A. See Also mkl_progress ?gbtrf Computes the LU factorization of a general m-by-n band matrix. Syntax Fortran 77: call sgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call dgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call cgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) call zgbtrf( m, n, kl, ku, ab, ldab, ipiv, info ) Fortran 95: call gbtrf( ab [,kl] [,m] [,ipiv] [,info] ) C: lapack_int LAPACKE_gbtrf( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, * ab, lapack_int ldab, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the LU factorization of a general m-by-n band matrix A with kl non-zero subdiagonals and ku non-zero superdiagonals, that is, A = P*L*U, where P is a permutation matrix; L is lower triangular with unit diagonal elements and at most kl non-zero elements in each column; U is an upper triangular band matrix with kl + ku superdiagonals. The routine uses partial pivoting, with row interchanges (which creates the additional kl superdiagonals in U). NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in matrix A; m = 0. LAPACK Routines: Linear Equations 3 359 n INTEGER. The number of columns in matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbtrf DOUBLE PRECISION for dgbtrf COMPLEX for cgbtrf DOUBLE COMPLEX for zgbtrf. Array, DIMENSION (ldab,*). The array ab contains the matrix A in band storage, in rows kl + 1 to 2*kl + ku + 1; rows 1 to kl of the array need not be set. The j-th column of A is stored in the j-th column of the array ab as follows: ab(kl + ku + 1 + i - j, j) = a(i,j) for max(1,j-ku) = i = min(m,j+kl). ldab INTEGER. The leading dimension of the array ab. (ldab = 2*kl + ku + 1) Output Parameters ab Overwritten by L and U. U is stored as an upper triangular band matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1, and the multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. See Application Notes below for further details. ipiv INTEGER. Array, DIMENSION at least max(1,min(m, n)). The pivot indices; for 1 = i = min(m, n) , row i was interchanged with row ipiv(i). . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by 0 will occur if you use the factor U for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbtrf interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). ipiv Holds the vector of length min(m,n). kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. m If omitted, assumed m = n. Application Notes The computed L and U are the exact factors of a perturbed matrix A + E, where |E| = c(kl+ku+1) e P|L||U| c(k) is a modest linear function of k, and e is the machine precision. 3 Intel® Math Kernel Library Reference Manual 360 The total number of floating-point operations for real flavors varies between approximately 2n(ku+1)kl and 2n(kl+ku+1)kl. The number of operations for complex flavors is four times greater. All these estimates assume that kl and ku are much less than min(m,n). The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1: Array elements marked * are not used by the routine; elements marked + need not be set on entry, but are required by the routine to store elements ofU because of fill-in resulting from the row interchanges. After calling this routine with m = n, you can call the following routines: gbtrs to solve A*X = B or AT*X = B or AH*X = B gbcon to estimate the condition number of A. See Also mkl_progress ?gttrf Computes the LU factorization of a tridiagonal matrix. Syntax Fortran 77: call sgttrf( n, dl, d, du, du2, ipiv, info ) call dgttrf( n, dl, d, du, du2, ipiv, info ) call cgttrf( n, dl, d, du, du2, ipiv, info ) call zgttrf( n, dl, d, du, du2, ipiv, info ) Fortran 95: call gttrf( dl, d, du, du2 [, ipiv] [,info] ) C: lapack_int LAPACKE_gttrf( lapack_int n, * dl, * d, * du, * du2, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Linear Equations 3 361 Description The routine computes the LU factorization of a real or complex tridiagonal matrix A in the form A = P*L*U, where P is a permutation matrix; L is lower bidiagonal with unit diagonal elements; and U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. The routine uses elimination with partial pivoting and row interchanges. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. dl, d, du REAL for sgttrf DOUBLE PRECISION for dgttrf COMPLEX for cgttrf DOUBLE COMPLEX for zgttrf. Arrays containing elements of A. The array dl of dimension (n - 1) contains the subdiagonal elements of A. The array d of dimension n contains the diagonal elements of A. The array du of dimension (n - 1) contains the superdiagonal elements of A. Output Parameters dl Overwritten by the (n-1) multipliers that define the matrix L from the LU factorization of A. The matrix L has unit diagonal elements, and the (n-1) elements of dl form the subdiagonal. All other elements of L are zero. d Overwritten by the n diagonal elements of the upper triangular matrix U from the LU factorization of A. du Overwritten by the (n-1) elements of the first superdiagonal of U. du2 REAL for sgttrf DOUBLE PRECISION for dgttrf COMPLEX for cgttrf DOUBLE COMPLEX for zgttrf. Array, dimension (n -2). On exit, du2 contains (n-2) elements of the second superdiagonal of U. ipiv INTEGER. Array, dimension (n). The pivot indices: for 1 = i = n, row i was interchanged with row ipiv(i). ipiv(i) is always i or i+1; ipiv(i) = i indicates a row interchange was not required. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by zero will occur if you use the factor U for solving a system of linear equations. 3 Intel® Math Kernel Library Reference Manual 362 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gttrf interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. Application Notes ?gbtrs to solve A*X = B or AT*X = B or AH*X = B ?gbcon to estimate the condition number of A. ?dttrfb Computes the factorization of a diagonally dominant tridiagonal matrix. Syntax Fortran 77: call sdttrfb( n, dl, d, du, info ) call ddttrfb( n, dl, d, du, info ) call cdttrfb( n, dl, d, du, info ) call zdttrfb( n, dl, d, du, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dttrfb routine computes the factorization of a real or complex diagonally dominant tridiagonal matrix A with the BABE (Burning At Both Ends) algorithm in the form A = L1*U*L2 where • L1, L2 are lower bidiagonal with unit diagonal elements corresponding to the Gaussian elimination taken from both ends of the matrix. • U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. Input Parameters n INTEGER. The order of the matrix A; n = 0. dl, d, du REAL for sdttrfb DOUBLE PRECISION for ddttrfb COMPLEX for cdttrfb DOUBLE COMPLEX for zdttrfb. LAPACK Routines: Linear Equations 3 363 Arrays containing elements of A. The array dl of dimension (n - 1) contains the subdiagonal elements of A. The array d of dimension n contains the diagonal elements of A. The array du of dimension (n - 1) contains the superdiagonal elements of A. Output Parameters dl Overwritten by the (n -1) multipliers that define the matrix L from the LU factorization of A. d Overwritten by the n diagonal element reciprocals of the upper triangular matrix U from the factorization of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is 0. The factorization has been completed, but U is exactly singular. Division by zero will occur if you use the factor U for solving a system of linear equations. Application Notes A diagonally dominant tridiagonal system is defined such that |di| > |dli-1| + |dui| for any i: 1 < i < n, and |d1| > |du1|, |dn| > |dln-1| The underlying BABE algorithm is designed for diagonally dominant systems. Such systems are free from the numerical stability issue unlike the canonical systems that use elimination with partial pivoting (see ?gttrf). The diagonally dominant systems are much faster than the canonical systems. NOTE • The current implementation of BABE has a potential accuracy issue on very small or large data close to the underflow or overflow threshold respectively. Scale the matrix before applying the solver in the case of such input data. • Applying the ?dttrfb factorization to non-diagonally dominant systems may lead to an accuracy loss, or false singularity detected due to no pivoting. ?potrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotrf( uplo, n, a, lda, info ) call dpotrf( uplo, n, a, lda, info ) call cpotrf( uplo, n, a, lda, info ) call zpotrf( uplo, n, a, lda, info ) Fortran 95: call potrf( a [, uplo] [,info] ) 3 Intel® Math Kernel Library Reference Manual 364 C: lapack_int LAPACKE_potrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. a REAL for spotrf DOUBLE PRECISION for dpotrf COMPLEX for cpotrf DOUBLE COMPLEX for zpotrf. Array, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a. Output Parameters a The upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 365 If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potrf interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?potrs to solve A*X = B ?pocon to estimate the condition number of A ?potri to compute the inverse of A. See Also mkl_progress ?pstrf Computes the Cholesky factorization with complete pivoting of a real symmetric (complex Hermitian) positive semidefinite matrix. Syntax Fortran 77: call spstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call dpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call cpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) call zpstrf( uplo, n, a, lda, piv, rank, tol, work, info ) C: lapack_int LAPACKE_spstrf( int matrix_order, char uplo, lapack_int n, float* a, lapack_int lda, lapack_int* piv, lapack_int* rank, float tol ); lapack_int LAPACKE_dpstrf( int matrix_order, char uplo, lapack_int n, double* a, lapack_int lda, lapack_int* piv, lapack_int* rank, double tol ); 3 Intel® Math Kernel Library Reference Manual 366 lapack_int LAPACKE_cpstrf( int matrix_order, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, lapack_int* piv, lapack_int* rank, float tol ); lapack_int LAPACKE_zpstrf( int matrix_order, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, lapack_int* piv, lapack_int* rank, double tol ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the Cholesky factorization with complete pivoting of a real symmetric (complex Hermitian) positive semidefinite matrix. The form of the factorization is: PT * A * P = UT * U, if uplo ='U' for real flavors, PH * A * P = UH * U, if uplo ='U' for complex flavors, PT * A * P = L * LT, if uplo ='L' for real flavors, PH * A * P = L * LH, if uplo ='L' for complex flavors, where P is stored as vector piv, 'U' and 'L' are upper and lower triangular matrices respectively. This algorithm does not attempt to check that A is positive semidefinite. This version of the algorithm calls level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and the strictly lower triangular part of the matrix is not referenced. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and the strictly upper triangular part of the matrix is not referenced. n INTEGER. The order of matrix A; n = 0. a, work REAL for spstrf DOUBLE PRECISION for dpstrf COMPLEX for cpstrf DOUBLE COMPLEX for zpstrf. Array a, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). work(*) is a workspace array. The dimension of work is at least max(1,2*n). tol REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 367 User difined tolerance. If tol < 0, then n*U*max(a(k,k)) will be used. The algorithm terminates at the (k-1)-th step, if the pivot = tol. lda INTEGER. The leading dimension of a; at least max(1, n). Output Parameters a If info = 0, the factor U or L from the Cholesky factorization is as described in Description. piv INTEGER. Array, DIMENSION at least max(1, n). The array piv is such that the nonzero entries are p( piv(k),k ) = 1. rank INTEGER. The rank of a given by the number of steps the algorithm completed. info INTEGER. If info = 0, the execution is successful. If info = -k, the k-th argument had an illegal value. If info > 0, the matrix A is either rank deficient with a computed rank as returned in rank, or is indefinite. ?pftrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix using the Rectangular Full Packed (RFP) format . Syntax Fortran 77: call spftrf( transr, uplo, n, a, info ) call dpftrf( transr, uplo, n, a, info ) call cpftrf( transr, uplo, n, a, info ) call zpftrf( transr, uplo, n, a, info ) C: lapack_int LAPACKE_pftrf( int matrix_order, char transr, char uplo, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, a Hermitian positive-definite matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. 3 Intel® Math Kernel Library Reference Manual 368 This is the block version of the algorithm, calling Level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a REAL for spftrf DOUBLE PRECISION for dpftrf COMPLEX for cpftrf DOUBLE COMPLEX for zpftrf. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by info. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. ?pptrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix using packed storage. Syntax Fortran 77: call spptrf( uplo, n, ap, info ) call dpptrf( uplo, n, ap, info ) call cpptrf( uplo, n, ap, info ) call zpptrf( uplo, n, ap, info ) Fortran 95: call pptrf( ap [, uplo] [,info] ) LAPACK Routines: Linear Equations 3 369 C: lapack_int LAPACKE_pptrf( int matrix_order, char uplo, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite packed matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap, and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as UH*U. If uplo = 'L', the array ap stores the lower triangular part of the matrix A; A is factored as L*LH. n INTEGER. The order of matrix A; n = 0. ap REAL for spptrf DOUBLE PRECISION for dpptrf COMPLEX for cpptrf DOUBLE COMPLEX for zpptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangular part of A in packed storage is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. 3 Intel® Math Kernel Library Reference Manual 370 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?pptrs to solve A*X = B ?ppcon to estimate the condition number of A ?pptri to compute the inverse of A. See Also mkl_progress ?pbtrf Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbtrf( uplo, n, kd, ab, ldab, info ) call dpbtrf( uplo, n, kd, ab, ldab, info ) call cpbtrf( uplo, n, kd, ab, ldab, info ) call zpbtrf( uplo, n, kd, ab, ldab, info ) Fortran 95: call pbtrf( ab [, uplo] [,info] ) C: lapack_int LAPACKE_pbtrf( int matrix_order, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 371 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the Cholesky factorization of a symmetric positive-definite or, for complex data, Hermitian positive-definite band matrix A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored in the array ab, and how A is factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab REAL for spbtrf DOUBLE PRECISION for dpbtrf COMPLEX for cpbtrf DOUBLE COMPLEX for zpbtrf. Array, DIMENSION (,*). The array ab contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab. (ldab = kd + 1) Output Parameters ab The upper or lower triangular part of A (in band storage) is overwritten by the Cholesky factor U or L, as specified by uplo. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the factorization could not be completed. This may indicate an error in forming the matrix A. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 3 Intel® Math Kernel Library Reference Manual 372 Specific details for the routine pbtrf interface are as follows: ab Holds the array A of size (kd+1,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed factor U is the exact factor of a perturbed matrix A + E, where c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. The total number of floating-point operations for real flavors is approximately n(kd+1)2. The number of operations for complex flavors is 4 times greater. All these estimates assume that kd is much less than n. After calling this routine, you can call the following routines: ?pbtrs to solve A*X = B ?pbcon to estimate the condition number of A. See Also mkl_progress ?pttrf Computes the factorization of a symmetric (Hermitian) positive-definite tridiagonal matrix. Syntax Fortran 77: call spttrf( n, d, e, info ) call dpttrf( n, d, e, info ) call cpttrf( n, d, e, info ) call zpttrf( n, d, e, info ) Fortran 95: call pttrf( d, e [,info] ) C: lapack_int LAPACKE_spttrf( lapack_int n, float* d, float* e ); lapack_int LAPACKE_dpttrf( lapack_int n, double* d, double* e ); lapack_int LAPACKE_cpttrf( lapack_int n, float* d, lapack_complex_float* e ); lapack_int LAPACKE_zpttrf( lapack_int n, double* d, lapack_complex_double* e ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Linear Equations 3 373 Description The routine forms the factorization of a symmetric positive-definite or, for complex data, Hermitian positivedefinite tridiagonal matrix A: A = L*D*LT for real flavors, or A = L*D*LH for complex flavors, where D is diagonal and L is unit lower bidiagonal. The factorization may also be regarded as having the form A = UT*D*U for real flavors, or A = UH*D*U for complex flavors, where D is unit upper bidiagonal. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. d REAL for spttrf, cpttrf DOUBLE PRECISION for dpttrf, zpttrf. Array, dimension (n). Contains the diagonal elements of A. e REAL for spttrf DOUBLE PRECISION for dpttrf COMPLEX for cpttrf DOUBLE COMPLEX for zpttrf. Array, dimension (n -1). Contains the subdiagonal elements of A. Output Parameters d Overwritten by the n diagonal elements of the diagonal matrix D from the L*D*LT (for real flavors) or L*D*LH (for complex flavors) factorization of A. e Overwritten by the (n - 1) off-diagonal elements of the unit bidiagonal factor L or U from the factorization of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite; if i < n, the factorization could not be completed, while if i = n, the factorization was completed, but d(n) = 0. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pttrf interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). ?sytrf Computes the Bunch-Kaufman factorization of a symmetric matrix. 3 Intel® Math Kernel Library Reference Manual 374 Syntax Fortran 77: call ssytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call dsytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call csytrf( uplo, n, a, lda, ipiv, work, lwork, info ) call zsytrf( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call sytrf( a [, uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sytrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a real/complex symmetric matrix A using the Bunch-Kaufman diagonal pivoting method. The form of the factorization is: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Routine section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. a REAL for ssytrf DOUBLE PRECISION for dsytrf COMPLEX for csytrf DOUBLE COMPLEX for zsytrf. LAPACK Routines: Linear Equations 3 375 Array, DIMENSION (lda,*). The array a contains either the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; at least max(1, n). work Same type as a. A workspace array, dimension at least max(1,lwork). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a The upper or lower triangular part of a is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k >0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, Dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrf interface are as follows: a holds the matrix A of size (n, n) ipiv holds the vector of length n uplo must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. 3 Intel® Math Kernel Library Reference Manual 376 If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i =1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?sytrs to solve A*X = B ?sycon to estimate the condition number of A ?sytri to compute the inverse of A. If uplo = 'U', then A = U*D*U', where U = P(n)*U(n)* ... *P(k)*U(k)*..., that is, U is a product of terms P(k)*U(k), where • k decreases from n to 1 in steps of 1 and 2. • D is a block diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks D(k). • P(k) is a permutation matrix as defined by ipiv(k). • U(k) is a unit upper triangular matrix, such that if the diagonal block D(k) is of order s (s = 1 or 2), then If s = 1, D(k) overwrites A(k,k), and v overwrites A(1:k-1,k). LAPACK Routines: Linear Equations 3 377 If s = 2, the upper triangle of D(k) overwrites A(k-1,k-1), A(k-1,k) and A(k,k), and v overwrites A(1:k-2,k -1:k). If uplo = 'L', then A = L*D*L', where L = P(1)*L(1)* ... *P(k)*L(k)*..., that is, L is a product of terms P(k)*L(k), where • k decreases from 1 to n in steps of 1 and 2. • D is a block diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks D(k). • P(k) is a permutation matrix as defined by ipiv(k). • L(k) is a unit lower triangular matrix, such that if the diagonal block D(k) is of order s (s = 1 or 2), then If s = 1, D(k) overwrites A(k,k), and v overwrites A(k+1:n,k). If s = 2, the lower triangle of D(k) overwrites A(k,k), A(k+1,k), and A(k+1,k+1), and v overwrites A(k +2:n,k:k+1). See Also mkl_progress ?hetrf Computes the Bunch-Kaufman factorization of a complex Hermitian matrix. Syntax Fortran 77: call chetrf( uplo, n, a, lda, ipiv, work, lwork, info ) call zhetrf( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call hetrf( a [, uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hetrf( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 378 Description The routine computes the factorization of a complex Hermitian matrix A using the Bunch-Kaufman diagonal pivoting method: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Routine section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. a, work COMPLEX for chetrf DOUBLE COMPLEX for zhetrf. Arrays, DIMENSION a(lda,*), work(*). The array a contains the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). work(*) is a workspace array of dimension at least max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a The upper or lower triangular part of a is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. ipiv INTEGER. LAPACK Routines: Linear Equations 3 379 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrf interface are as follows: a holds the matrix A of size (n, n) ipiv holds the vector of length n uplo must be 'U' or 'L'. The default value is 'U'. Application Notes This routine is suitable for Hermitian matrices that are not known to be positive-definite. If A is in fact positive-definite, the routine does not perform interchanges, and no 2-by-2 diagonal blocks occur in D. For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i =1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT 3 Intel® Math Kernel Library Reference Manual 380 c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (4/3)n3. After calling this routine, you can call the following routines: ?hetrs to solve A*X = B ?hecon to estimate the condition number of A ?hetri to compute the inverse of A. See Also mkl_progress ?sptrf Computes the Bunch-Kaufman factorization of a symmetric matrix using packed storage. Syntax Fortran 77: call ssptrf( uplo, n, ap, ipiv, info ) call dsptrf( uplo, n, ap, ipiv, info ) call csptrf( uplo, n, ap, ipiv, info ) call zsptrf( uplo, n, ap, ipiv, info ) Fortran 95: call sptrf( ap [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sptrf( int matrix_order, char uplo, lapack_int n, * ap, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a real/complex symmetric matrix A stored in the packed format using the Bunch-Kaufman diagonal pivoting method. The form of the factorization is: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. LAPACK Routines: Linear Equations 3 381 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as P*U*D*UT*PT. If uplo = 'L', the array ap stores the lower triangular part of the matrix A, and A is factored as P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap REAL for ssptrf DOUBLE PRECISION for dsptrf COMPLEX for csptrf DOUBLE COMPLEX for zsptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangle of A (as specified by uplo) is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. 3 Intel® Math Kernel Library Reference Manual 382 Application Notes The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L overwrite elements of the corresponding columns of the matrix A, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i = 1...n, then all off-diagonal elements of U (L) are stored explicitly in packed form. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (1/3)n3 for real flavors or (4/3)n3 for complex flavors. After calling this routine, you can call the following routines: ?sptrs to solve A*X = B ?spcon to estimate the condition number of A ?sptri to compute the inverse of A. See Also mkl_progress ?hptrf Computes the Bunch-Kaufman factorization of a complex Hermitian matrix using packed storage. Syntax Fortran 77: call chptrf( uplo, n, ap, ipiv, info ) call zhptrf( uplo, n, ap, ipiv, info ) Fortran 95: call hptrf( ap [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hptrf( int matrix_order, char uplo, lapack_int n, * ap, lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the factorization of a complex Hermitian packed matrix A using the Bunch-Kaufman diagonal pivoting method: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, LAPACK Routines: Linear Equations 3 383 where A is the input matrix, P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the matrix A, and A is factored as P*U*D*UH*PT. If uplo = 'L', the array ap stores the lower triangular part of the matrix A, and A is factored as P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. ap COMPLEX for chptrf DOUBLE COMPLEX for zhptrf. Array, DIMENSION at least max(1, n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters ap The upper or lower triangle of A (as specified by uplo) is overwritten by details of the block-diagonal matrix D and the multipliers used to obtain the factor U (or L). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and the (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and the (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular. Division by 0 will occur if you use D for solving a system of linear equations. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. 3 Intel® Math Kernel Library Reference Manual 384 Specific details for the routine hptrf interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The 2-by-2 unit diagonal blocks and the unit diagonal elements of U and L are not stored. The remaining elements of U and L are stored in the corresponding columns of the array a, but additional row interchanges are required to recover U or L explicitly (which is seldom necessary). If ipiv(i) = i for all i = 1...n, then all off-diagonal elements of U (L) are stored explicitly in the corresponding elements of the array a. If uplo = 'U', the computed factors U and D are the exact factors of a perturbed matrix A + E, where |E| = c(n)e P|U||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for the computed L and D when uplo = 'L'. The total number of floating-point operations is approximately (4/3)n3. After calling this routine, you can call the following routines: ?hptrs to solve A*X = B ?hpcon to estimate the condition number of A ?hptri to compute the inverse of A. See Also mkl_progress Routines for Solving Systems of Linear Equations This section describes the LAPACK routines for solving systems of linear equations. Before calling most of these routines, you need to factorize the matrix of your system of equations (see Routines for Matrix Factorization in this chapter). However, the factorization is not necessary if your system of equations has a triangular matrix. ?getrs Solves a system of linear equations with an LUfactored square matrix, with multiple right-hand sides. Syntax Fortran 77: call sgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call dgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call cgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) call zgetrs( trans, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call getrs( a, ipiv, b [, trans] [,info] ) LAPACK Routines: Linear Equations 3 385 C: lapack_int LAPACKE_getrs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, you must call ?getrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. n INTEGER. The order of A; the number of rows in B(n = 0). nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for sgetrs DOUBLE PRECISION for dgetrs COMPLEX for cgetrs DOUBLE COMPLEX for zgetrs. Arrays: a(lda,*), b(ldb,*). The array a contains LU factorization of matrix A resulting from the call of ?getrf . The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. Output Parameters b Overwritten by the solution matrix X. 3 Intel® Math Kernel Library Reference Manual 386 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?gecon. To refine the solution and estimate the error, call ?gerfs. ?gbtrs Solves a system of linear equations with an LUfactored band matrix, with multiple right-hand sides. Syntax Fortran 77: call sgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call dgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call cgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call zgbtrs( trans, n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) LAPACK Routines: Linear Equations 3 387 Fortran 95: call gbtrs( ab, b, ipiv, [, kl] [, trans] [, info] ) C: lapack_int LAPACKE_gbtrs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const * ab, lapack_int ldab, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Here A is an LU-factored general band matrix of order n with kl non-zero subdiagonals and ku nonzero superdiagonals. Before calling this routine, call ?gbtrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. n INTEGER. The order of A; the number of rows in B; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for sgbtrs DOUBLE PRECISION for dgbtrs COMPLEX for cgbtrs DOUBLE COMPLEX for zgbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), and the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = 2*kl + ku +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. 3 Intel® Math Kernel Library Reference Manual 388 Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbtrs interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length min(m, n). kl If omitted, assumed kl = ku. ku Restored as lda-2*kl-1. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(kl + ku + 1)e P|L||U| c(k) is a modest linear function of k, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector is 2n(ku + 2kl) for real flavors. The number of operations for complex flavors is 4 times greater. All these estimates assume that kl and ku are much less than min(m,n). To estimate the condition number ?8(A), call ?gbcon. To refine the solution and estimate the error, call ?gbrfs. ?gttrs Solves a system of linear equations with a tridiagonal matrix using the LU factorization computed by ? gttrf. LAPACK Routines: Linear Equations 3 389 Syntax Fortran 77: call sgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call dgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call cgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) call zgttrs( trans, n, nrhs, dl, d, du, du2, ipiv, b, ldb, info ) Fortran 95: call gttrs( dl, d, du, du2, b, ipiv [, trans] [,info] ) C: lapack_int LAPACKE_gttrs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const * dl, const * d, const * du, const * du2, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with multiple right hand sides: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, you must call ?gttrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns in B; nrhs = 0. dl,d,du,du2,b REAL for sgttrs DOUBLE PRECISION for dgttrs COMPLEX for cgttrs DOUBLE COMPLEX for zgttrs. Arrays: dl(n -1), d(n), du(n -1), du2(n -2), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A. 3 Intel® Math Kernel Library Reference Manual 390 The array d contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first superdiagonal of U. The array du2 contains the (n - 2) elements of the second superdiagonal of U. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION (n). The ipiv array, as returned by ? gttrf. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gttrs interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|L||U| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). LAPACK Routines: Linear Equations 3 391 The approximate number of floating-point operations for one right-hand side vector b is 7n (including n divisions) for real flavors and 34n (including 2n divisions) for complex flavors. To estimate the condition number ?8(A), call ?gtcon. To refine the solution and estimate the error, call ?gtrfs. ?dttrsb Solves a system of linear equations with a diagonally dominant tridiagonal matrix using the LU factorization computed by ?dttrfb. Syntax Fortran 77: call sdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call ddttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call cdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) call zdttrsb( trans, n, nrhs, dl, d, du, b, ldb, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dttrsb routine solves the following systems of linear equations with multiple right hand sides for X: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Before calling this routine, call ?dttrfb to compute the factorization of A. Input Parameters trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations solved for X: If trans = 'N', then A*X = B. If trans = 'T', then AT*X = B. If trans = 'C', then AH*X = B. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sdttrsb DOUBLE PRECISION for ddttrsb COMPLEX for cdttrsb DOUBLE COMPLEX for zdttrsb. Arrays: dl(n -1), d(n), du(n -1), b(ldb,nrhs). The array dl contains the (n - 1) multipliers that define the matrices L1, L2 from the factorization of A. The array d contains the n diagonal elements of the upper triangular matrix U from the factorization of A. The array du contains the (n - 1) elements of the superdiagonal of U. 3 Intel® Math Kernel Library Reference Manual 392 The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?potrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call dpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call cpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) call zpotrs( uplo, n, nrhs, a, lda, b, ldb, info ) Fortran 95: call potrs( a, b [,uplo] [, info] ) C: lapack_int LAPACKE_potrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?potrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 393 uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides (nrhs = 0). a, b REAL for spotrs DOUBLE PRECISION for dpotrs COMPLEX for cpotrs DOUBLE COMPLEX for zpotrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If uplo = 'U', the computed solution for each right-hand side b is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |UH||U| c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). 3 Intel® Math Kernel Library Reference Manual 394 Note that cond(A,x) can be much smaller than ?8 (A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?pocon. To refine the solution and estimate the error, call ?porfs. ?pftrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite matrix using the Rectangular Full Packed (RFP) format. Syntax Fortran 77: call spftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call dpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call cpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) call zpftrs( transr, uplo, n, nrhs, a, b, ldb, info ) C: lapack_int LAPACKE_pftrs( int matrix_order, char transr, char uplo, lapack_int n, lapack_int nrhs, const * a, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A using the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' computed by ?pftrf. L stands for a lower triangular matrix and U - for an upper triangular matrix. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. LAPACK Routines: Linear Equations 3 395 Indicates whether the upper or lower triangular part of the RFP matrix A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. a, b REAL for spftrs DOUBLE PRECISION for dpftrs COMPLEX for cpftrs DOUBLE COMPLEX for zpftrs. Arrays: a(n*(n+1)/2), b(ldb,nrhs). The array a contains the matrix A in the RFP format. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b The solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. ?pptrs Solves a system of linear equations with a packed Cholesky-factored symmetric (Hermitian) positivedefinite matrix. Syntax Fortran 77: call spptrs( uplo, n, nrhs, ap, b, ldb, info ) call dpptrs( uplo, n, nrhs, ap, b, ldb, info ) call cpptrs( uplo, n, nrhs, ap, b, ldb, info ) call zpptrs( uplo, n, nrhs, ap, b, ldb, info ) Fortran 95: call pptrs( ap, b [,uplo] [,info] ) C: lapack_int LAPACKE_pptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 396 Description The routine solves for X the system of linear equations A*X = B with a packed symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?pptrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides (nrhs = 0). ap, b REAL for spptrs DOUBLE PRECISION for dpptrs COMPLEX for cpptrs DOUBLE COMPLEX for zpptrs. Arrays: ap(*), b(ldb,*) The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 397 Application Notes If uplo = 'U', the computed solution for each right-hand side b is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |UH||U| c(n) is a modest linear function of n, and e is the machine precision. A similar estimate holds for uplo = 'L'. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n2 for real flavors and 8n2 for complex flavors. To estimate the condition number ?8(A), call ?ppcon. To refine the solution and estimate the error, call ?pprfs. ?pbtrs Solves a system of linear equations with a Choleskyfactored symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call dpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call cpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call zpbtrs( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call pbtrs( ab, b [,uplo] [,info] ) C: lapack_int LAPACKE_pbtrs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 398 Description The routine solves for real data a system of linear equations A*X = B with a symmetric positive-definite or, for complex data, Hermitian positive-definite band matrix A, given the Cholesky factorization of A: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' where L is a lower triangular matrix and U is upper triangular. The system is solved with multiple right-hand sides stored in the columns of the matrix B. Before calling this routine, you must call ?pbtrf to compute the Cholesky factorization of A in the band storage form. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangular factor is stored in ab. If uplo = 'L', the lower triangular factor is stored in ab. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for spbtrs DOUBLE PRECISION for dpbtrs COMPLEX for cpbtrs DOUBLE COMPLEX for zpbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the Cholesky factor, as returned by the factorization routine, in band storage form. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), and the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbtrs interface are as follows: LAPACK Routines: Linear Equations 3 399 ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(kd + 1)e P|UH||U| or |E| = c(kd + 1)e P|LH||L| c(k) is a modest linear function of k, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The approximate number of floating-point operations for one right-hand side vector is 4n*kd for real flavors and 16n*kd for complex flavors. To estimate the condition number ?8(A), call ?pbcon. To refine the solution and estimate the error, call ?pbrfs. ?pttrs Solves a system of linear equations with a symmetric (Hermitian) positive-definite tridiagonal matrix using the factorization computed by ?pttrf. Syntax Fortran 77: call spttrs( n, nrhs, d, e, b, ldb, info ) call dpttrs( n, nrhs, d, e, b, ldb, info ) call cpttrs( uplo, n, nrhs, d, e, b, ldb, info ) call zpttrs( uplo, n, nrhs, d, e, b, ldb, info ) Fortran 95: call pttrs( d, e, b [,info] ) call pttrs( d, e, b [,uplo] [,info] ) C: lapack_int LAPACKE_spttrs( int matrix_order, lapack_int n, lapack_int nrhs, const float* d, const float* e, float* b, lapack_int ldb ); lapack_int LAPACKE_dpttrs( int matrix_order, lapack_int n, lapack_int nrhs, const double* d, const double* e, double* b, lapack_int ldb ); 3 Intel® Math Kernel Library Reference Manual 400 lapack_int LAPACKE_cpttrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, lapack_complex_float* b, lapack_int ldb ); lapack_int LAPACKE_zpttrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, lapack_complex_double* b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X a system of linear equations A*X = B with a symmetric (Hermitian) positive-definite tridiagonal matrix A. Before calling this routine, call ?pttrf to compute the L*D*L' for real data and the L*D*L' or U'*D*U factorization of A for complex data. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Used for cpttrs/zpttrs only. Must be 'U' or 'L'. Specifies whether the superdiagonal or the subdiagonal of the tridiagonal matrix A is stored and how A is factored: If uplo = 'U', the array e stores the superdiagonal of A, and A is factored as U'*D*U. If uplo = 'L', the array e stores the subdiagonal of A, and A is factored as L*D*L'. n INTEGER. The order of A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. d REAL for spttrs, cpttrs DOUBLE PRECISION for dpttrs, zpttrs. Array, dimension (n). Contains the diagonal elements of the diagonal matrix D from the factorization computed by ?pttrf. e, b REAL for spttrs DOUBLE PRECISION for dpttrs COMPLEX for cpttrs DOUBLE COMPLEX for zpttrs. Arrays: e(n -1), b(ldb, nrhs). The array e contains the (n - 1) off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. LAPACK Routines: Linear Equations 3 401 info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pttrs interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n, nrhs). uplo Used in complex flavors only. Must be 'U' or 'L'. The default value is 'U'. ?sytrs Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix. Syntax Fortran 77: call ssytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call dsytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call csytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call zsytrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call sytrs( a, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sytrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = P*U*D*UT*PT if uplo='L', A = P*L*D*LT*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the factor U (or L) and the array ipiv returned by the factorization routine ?sytrf. 3 Intel® Math Kernel Library Reference Manual 402 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. a, b REAL for ssytrs DOUBLE PRECISION for dsytrs COMPLEX for csytrs DOUBLE COMPLEX for zsytrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UT|PT or |E| = c(n)e P|L||D||UT|PT c(n) is a modest linear function of n, and e is the machine precision. LAPACK Routines: Linear Equations 3 403 If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 2n2 for real flavors or 8n2 for complex flavors. To estimate the condition number ?8(A), call ?sycon. To refine the solution and estimate the error, call ?syrfs. ?hetrs Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix. Syntax Fortran 77: call chetrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) call zhetrs( uplo, n, nrhs, a, lda, ipiv, b, ldb, info ) Fortran 95: call hetrs( a, b, ipiv [, uplo] [,info] ) C: lapack_int LAPACKE_hetrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a Hermitian matrix A, given the Bunch- Kaufman factorization of A: if uplo = 'U' A = P*U*D*UH*PT if uplo = 'L' A = P*L*D*LH*PT, where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the factor U (or L) and the array ipiv returned by the factorization routine ?hetrf. 3 Intel® Math Kernel Library Reference Manual 404 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. a, b COMPLEX for chetrs DOUBLE COMPLEX for zhetrs. Arrays: a(lda,*), b(ldb,*). The array a contains the factor U or L (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrs interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UH|PT or |E| = c(n)e P|L||D||LH|PT c(n) is a modest linear function of n, and e is the machine precision. LAPACK Routines: Linear Equations 3 405 If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 8n2. To estimate the condition number ?8(A), call ?hecon. To refine the solution and estimate the error, call ?herfs. ?sytrs2 Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix computed by ?sytrf and converted by ?syconv. Syntax Fortran 77: call ssytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call dsytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call csytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call zsytrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) Fortran 95: call sytrs2( a,b,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_sytrs2( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a symmetric matrix A using the factorization of A: if uplo='U', A = U*D*UT if uplo='L', A = L*D*LT where • U and L are upper and lower triangular matrices with unit diagonal • D is a symmetric block-diagonal matrix. 3 Intel® Math Kernel Library Reference Manual 406 The factorization is computed by ?sytrf and converted by ?syconv. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = U*D*UT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = L*D*LT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for ssytrs2 DOUBLE PRECISION for dsytrs2 COMPLEX for csytrs2 DOUBLE COMPLEX for zsytrs2 Arrays: a(lda,*), b(ldb,*). The array a contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. The array b contains the right-hand side matrix B. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array of DIMENSION n. The ipiv array contains details of the interchanges and the block structure of D as determined by ? sytrf. work REAL for ssytrs2 DOUBLE PRECISION for dsytrs2 COMPLEX for csytrs2 DOUBLE COMPLEX for zsytrs2 Workspace array, DIMENSION n. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrs2 interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. LAPACK Routines: Linear Equations 3 407 uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?syconv ?hetrs2 Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix computed by ?hetrf and converted by ?syconv. Syntax Fortran 77: call chetrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) call zhetrs2( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, info ) Fortran 95: call hetrs2( a, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hetrs2( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves a system of linear equations A*X = B with a complex Hermitian matrix A using the factorization of A: if uplo='U', A = U*D*UH if uplo='L', A = L*D*LH where • U and L are upper and lower triangular matrices with unit diagonal • D is a Hermitian block-diagonal matrix. The factorization is computed by ?hetrf and converted by ?syconv. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = U*D*UH. 3 Intel® Math Kernel Library Reference Manual 408 If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b COMPLEX for chetrs2 DOUBLE COMPLEX for zhetrs2 Arrays: a(lda,*), b(ldb,*). The array a contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?hetrf. The array b contains the right-hand side matrix B. The second dimension of a must be at least max(1,n), and the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array of DIMENSION n. The ipiv array contains details of the interchanges and the block structure of D as determined by ? hetrf. work COMPLEX for chetrs2 DOUBLE COMPLEX for zhetrs2 Workspace array, DIMENSION n. Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrs2 interface are as follows: a Holds the matrix A of size (n, n). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. See Also ?hetrf ?syconv ?sptrs Solves a system of linear equations with a UDU- or LDL-factored symmetric matrix using packed storage. Syntax Fortran 77: call ssptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call dsptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) LAPACK Routines: Linear Equations 3 409 call csptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zsptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call sptrs( ap, b, ipiv [, uplo] [,info] ) C: lapack_int LAPACKE_sptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a symmetric matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = PUDUTPT if uplo='L', A = PLDLTPT, where P is a permutation matrix, U and L are upper and lower packed triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply the factor U (or L) and the array ipiv returned by the factorization routine ?sptrf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. ap, b REAL for ssptrs DOUBLE PRECISION for dsptrs COMPLEX for csptrs DOUBLE COMPLEX for zsptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). 3 Intel® Math Kernel Library Reference Manual 410 The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UT|PT or |E| = c(n)e P|L||D||LT|PT c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 2n2 for real flavors or 8n2 for complex flavors. To estimate the condition number ?8(A), call ?spcon. To refine the solution and estimate the error, call ?sprfs. ?hptrs Solves a system of linear equations with a UDU- or LDL-factored Hermitian matrix using packed storage. LAPACK Routines: Linear Equations 3 411 Syntax Fortran 77: call chptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zhptrs( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call hptrs( ap, b, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hptrs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const * ap, const lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B with a Hermitian matrix A, given the Bunch- Kaufman factorization of A: if uplo='U', A = P*U*D*UH*PT if uplo='L', A = P*L*D*LH*PT, where P is a permutation matrix, U and L are upper and lower packed triangular matrices with unit diagonal, and D is a symmetric block-diagonal matrix. The system is solved with multiple right-hand sides stored in the columns of the matrix B. You must supply to this routine the arrays ap (containing U or L)and ipiv in the form returned by the factorization routine ?hptrf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array ap stores the packed factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. ap, b COMPLEX for chptrs DOUBLE COMPLEX for zhptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). 3 Intel® Math Kernel Library Reference Manual 412 The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e P|U||D||UH|PT or |E| = c(n)e P|L||D||LH|PT c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A). The total number of floating-point operations for one right-hand side vector is approximately 8n2 for complex flavors. To estimate the condition number ?8(A), call ?hpcon. To refine the solution and estimate the error, call ?hprfs. ?trtrs Solves a system of linear equations with a triangular matrix, with multiple right-hand sides. LAPACK Routines: Linear Equations 3 413 Syntax Fortran 77: call strtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call dtrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call ctrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) call ztrtrs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, info ) Fortran 95: call trtrs( a, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_trtrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const * a, lapack_int lda, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of A; the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b REAL for strtrs 3 Intel® Math Kernel Library Reference Manual 414 DOUBLE PRECISION for dtrtrs COMPLEX for ctrtrs DOUBLE COMPLEX for ztrtrs. Arrays: a(lda,*), b(ldb,*). The array a contains the matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of a must be at least max(1,n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trtrs interface are as follows: a Stands for argument ap in FORTRAN 77 interface. Holds the matrix A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is n2 for real flavors and 4n2 for complex flavors. LAPACK Routines: Linear Equations 3 415 To estimate the condition number ?8(A), call ?trcon. To estimate the error in the solution, call ?trrfs. ?tptrs Solves a system of linear equations with a packed triangular matrix, with multiple right-hand sides. Syntax Fortran 77: call stptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call dtptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call ctptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) call ztptrs( uplo, trans, diag, n, nrhs, ap, b, ldb, info ) Fortran 95: call tptrs( ap, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_tptrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a packed triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. 3 Intel® Math Kernel Library Reference Manual 416 If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of A; the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, b REAL for stptrs DOUBLE PRECISION for dtptrs COMPLEX for ctptrs DOUBLE COMPLEX for ztptrs. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the matrix A in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the system of equations. The second dimension of b must be at least max(1, nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tptrs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E| = c(n)e |A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). LAPACK Routines: Linear Equations 3 417 Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is n2 for real flavors and 4n2 for complex flavors. To estimate the condition number ?8(A), call ?tpcon. To estimate the error in the solution, call ?tprfs. ?tbtrs Solves a system of linear equations with a band triangular matrix, with multiple right-hand sides. Syntax Fortran 77: call stbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call dtbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call ctbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) call ztbtrs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call tbtrs( ab, b [,uplo] [, trans] [,diag] [,info] ) C: lapack_int LAPACKE_tbtrs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the following systems of linear equations with a band triangular matrix A, with multiple right-hand sides stored in B: A*X = B if trans='N', AT*X = B if trans='T', AH*X = B if trans='C' (for complex matrices only). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. 3 Intel® Math Kernel Library Reference Manual 418 trans CHARACTER*1. Must be 'N' or 'T' or 'C'. If trans = 'N', then A*X = B is solved for X. If trans = 'T', then AT*X = B is solved for X. If trans = 'C', then AH*X = B is solved for X. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of A; the number of rows in B; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b REAL for stbtrs DOUBLE PRECISION for dtbtrs COMPLEX for ctbtrs DOUBLE COMPLEX for ztbtrs. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage form. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of ab must be at least max(1, n), the second dimension of b at least max(1,nrhs). ldab INTEGER. The leading dimension of ab; ldab = kd + 1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters b Overwritten by the solution matrix X. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbtrs interface are as follows: ab Holds the array A of size (kd+1,n) b Holds the matrix B of size (n, nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes For each right-hand side b, the computed solution is the exact solution of a perturbed system of equations (A + E)x = b, where |E|= c(n)e|A| c(n) is a modest linear function of n, and e is the machine precision. If x0 is the true solution, the computed solution x satisfies this error bound: LAPACK Routines: Linear Equations 3 419 where cond(A,x)= || |A-1||A| |x| ||8 / ||x||8 = ||A-1||8 ||A||8 = ?8(A). Note that cond(A,x) can be much smaller than ?8(A); the condition number of AT and AH might or might not be equal to ?8(A). The approximate number of floating-point operations for one right-hand side vector b is 2n*kd for real flavors and 8n*kd for complex flavors. To estimate the condition number ?8(A), call ?tbcon. To estimate the error in the solution, call ?tbrfs. Routines for Estimating the Condition Number This section describes the LAPACK routines for estimating the condition number of a matrix. The condition number is used for analyzing the errors in the solution of a system of linear equations (see Error Analysis). Since the condition number may be arbitrarily large when the matrix is nearly singular, the routines actually compute the reciprocal condition number. ?gecon Estimates the reciprocal of the condition number of a general matrix in the 1-norm or the infinity-norm. Syntax Fortran 77: call sgecon( norm, n, a, lda, anorm, rcond, work, iwork, info ) call dgecon( norm, n, a, lda, anorm, rcond, work, iwork, info ) call cgecon( norm, n, a, lda, anorm, rcond, work, rwork, info ) call zgecon( norm, n, a, lda, anorm, rcond, work, rwork, info ) Fortran 95: call gecon( a, anorm, rcond [,norm] [,info] ) C: lapack_int LAPACKE_sgecon( int matrix_order, char norm, lapack_int n, const float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_dgecon( int matrix_order, char norm, lapack_int n, const double* a, lapack_int lda, double anorm, double* rcond ); lapack_int LAPACKE_cgecon( int matrix_order, char norm, lapack_int n, const lapack_complex_float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_zgecon( int matrix_order, char norm, lapack_int n, const lapack_complex_double* a, lapack_int lda, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 3 Intel® Math Kernel Library Reference Manual 420 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a general matrix A in the 1-norm or infinitynorm: ? 1(A) =||A||1||A-1||1 = ? 8(AT) = ? 8(AH) ? 8(A) =||A||8||A-1||8 = ? 1(AT) = ? 1(AH). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?getrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. a, work REAL for sgecon DOUBLE PRECISION for dgecon COMPLEX for cgecon DOUBLE COMPLEX for zgecon. Arrays: a(lda,*), work(*). The array a contains the LU-factored matrix A, as returned by ?getrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 4*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). lda INTEGER. The leading dimension of a; lda = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgecon DOUBLE PRECISION for zgecon. Workspace array, DIMENSION at least max(1, 2*n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond = 0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. LAPACK Routines: Linear Equations 3 421 info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gecon interface are as follows: a Holds the matrix A of size (n, n). norm Must be '1', 'O', or 'I'. The default value is '1'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b or AH*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2*n2 floating-point operations for real flavors and 8*n2 for complex flavors. ?gbcon Estimates the reciprocal of the condition number of a band matrix in the 1-norm or the infinity-norm. Syntax Fortran 77: call sgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, iwork, info ) call dgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, iwork, info ) call cgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, rwork, info ) call zgbcon( norm, n, kl, ku, ab, ldab, ipiv, anorm, rcond, work, rwork, info ) Fortran 95: call gbcon( ab, ipiv, anorm, rcond [,kl] [,norm] [,info] ) C: lapack_int LAPACKE_sgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zgbcon( int matrix_order, char norm, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 3 Intel® Math Kernel Library Reference Manual 422 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a general band matrix A in the 1-norm or infinity-norm: ?1(A) = ||A||1||A-1||1 = ?8(AT) = ?8(AH) ?8(A) = ||A||8||A-1||8 = ?1(AT) = ?1(AH). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?gbtrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ldab INTEGER. The leading dimension of the array ab. (ldab = 2*kl + ku +1). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. ab, work REAL for sgbcon DOUBLE PRECISION for dgbcon COMPLEX for cgbcon DOUBLE COMPLEX for zgbcon. Arrays: ab(ldab,*), work(*). The array ab contains the factored band matrix A, as returned by ? gbtrf. The second dimension of ab must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgbcon DOUBLE PRECISION for zgbcon. Workspace array, DIMENSION at least max(1, 2*n). LAPACK Routines: Linear Equations 3 423 Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbcon interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). ipiv Holds the vector of length n. norm Must be '1', 'O', or 'I'. The default value is '1'. kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b or AH*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n(ku + 2kl) floating-point operations for real flavors and 8n(ku + 2kl) for complex flavors. ?gtcon Estimates the reciprocal of the condition number of a tridiagonal matrix using the factorization computed by ?gttrf. Syntax Fortran 77: call sgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, iwork, info ) call dgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, iwork, info ) call cgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, info ) call zgtcon( norm, n, dl, d, du, du2, ipiv, anorm, rcond, work, info ) Fortran 95: call gtcon( dl, d, du, du2, ipiv, anorm, rcond [,norm] [,info] ) C: lapack_int LAPACKE_sgtcon( char norm, lapack_int n, const float* dl, const float* d, const float* du, const float* du2, const lapack_int* ipiv, float anorm, float* rcond ); 3 Intel® Math Kernel Library Reference Manual 424 lapack_int LAPACKE_dgtcon( char norm, lapack_int n, const double* dl, const double* d, const double* du, const double* du2, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cgtcon( char norm, lapack_int n, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, const lapack_complex_float* du2, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zgtcon( char norm, lapack_int n, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, const lapack_complex_double* du2, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a real or complex tridiagonal matrix A in the 1-norm or infinity-norm: ?1(A) = ||A||1||A-1||1 ?8(A) = ||A||8||A-1||8 An estimate is obtained for ||A-1||, and the reciprocal of the condition number is computed as rcond = 1 / (||A|| ||A-1||). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?gttrf to compute the LU factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. n INTEGER. The order of the matrix A; n = 0. dl,d,du,du2 REAL for sgtcon DOUBLE PRECISION for dgtcon COMPLEX for cgtcon DOUBLE COMPLEX for zgtcon. Arrays: dl(n -1), d(n), du(n -1), du2(n -2). The array dl contains the (n - 1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. The array d contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. The array du contains the (n - 1) elements of the first superdiagonal of U. LAPACK Routines: Linear Equations 3 425 The array du2 contains the (n - 2) elements of the second superdiagonal of U. ipiv INTEGER. Array, DIMENSION (n). The array of pivot indices, as returned by ? gttrf. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). work REAL for sgtcon DOUBLE PRECISION for dgtcon COMPLEX for cgtcon DOUBLE COMPLEX for zgtcon. Workspace array, DIMENSION (2*n). iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond=0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtcon interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. norm Must be '1', 'O', or 'I'. The default value is '1'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?pocon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite matrix. 3 Intel® Math Kernel Library Reference Manual 426 Syntax Fortran 77: call spocon( uplo, n, a, lda, anorm, rcond, work, iwork, info ) call dpocon( uplo, n, a, lda, anorm, rcond, work, iwork, info ) call cpocon( uplo, n, a, lda, anorm, rcond, work, rwork, info ) call zpocon( uplo, n, a, lda, anorm, rcond, work, rwork, info ) Fortran 95: call pocon( a, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_spocon( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_dpocon( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, double anorm, double* rcond ); lapack_int LAPACKE_cpocon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float anorm, float* rcond ); lapack_int LAPACKE_zpocon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?potrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. a, work REAL for spocon DOUBLE PRECISION for dpocon COMPLEX for cpocon LAPACK Routines: Linear Equations 3 427 DOUBLE COMPLEX for zpocon. Arrays: a(lda,*), work(*). The array a contains the factored matrix A, as returned by ?potrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpocon DOUBLE PRECISION for zpocon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pocon interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?ppcon Estimates the reciprocal of the condition number of a packed symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call sppcon( uplo, n, ap, anorm, rcond, work, iwork, info ) 3 Intel® Math Kernel Library Reference Manual 428 call dppcon( uplo, n, ap, anorm, rcond, work, iwork, info ) call cppcon( uplo, n, ap, anorm, rcond, work, rwork, info ) call zppcon( uplo, n, ap, anorm, rcond, work, rwork, info ) Fortran 95: call ppcon( ap, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_sppcon( int matrix_order, char uplo, lapack_int n, const float* ap, float anorm, float* rcond ); lapack_int LAPACKE_dppcon( int matrix_order, char uplo, lapack_int n, const double* ap, double anorm, double* rcond ); lapack_int LAPACKE_cppcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, float anorm, float* rcond ); lapack_int LAPACKE_zppcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a packed symmetric (Hermitian) positivedefinite matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?pptrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for sppcon DOUBLE PRECISION for dppcon COMPLEX for cppcon DOUBLE COMPLEX for zppcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? pptrf. The dimension of ap must be at least max(1,n(n+1)/2). LAPACK Routines: Linear Equations 3 429 The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cppcon DOUBLE PRECISION for zppcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?pbcon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite band matrix. Syntax Fortran 77: call spbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, iwork, info ) call dpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, iwork, info ) call cpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, rwork, info ) call zpbcon( uplo, n, kd, ab, ldab, anorm, rcond, work, rwork, info ) 3 Intel® Math Kernel Library Reference Manual 430 Fortran 95: call pbcon( ab, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_spbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float anorm, float* rcond ); lapack_int LAPACKE_dpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double anorm, double* rcond ); lapack_int LAPACKE_cpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float anorm, float* rcond ); lapack_int LAPACKE_zpbcon( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite band matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?pbtrf to compute the Cholesky factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangular factor is stored in ab. If uplo = 'L', the lower triangular factor is stored in ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ldab INTEGER. The leading dimension of the array ab. (ldab = kd +1). ab, work REAL for spbcon DOUBLE PRECISION for dpbcon COMPLEX for cpbcon DOUBLE COMPLEX for zpbcon. Arrays: ab(ldab,*), work(*). The array ab contains the factored matrix A in band form, as returned by ?pbtrf. The second dimension of ab must be at least max(1, n). LAPACK Routines: Linear Equations 3 431 The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpbcon DOUBLE PRECISION for zpbcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbcon interface are as follows: ab Holds the array A of size (kd+1,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4*n(kd + 1) floating-point operations for real flavors and 16*n(kd + 1) for complex flavors. ?ptcon Estimates the reciprocal of the condition number of a symmetric (Hermitian) positive-definite tridiagonal matrix. Syntax Fortran 77: call sptcon( n, d, e, anorm, rcond, work, info ) call dptcon( n, d, e, anorm, rcond, work, info ) call cptcon( n, d, e, anorm, rcond, work, info ) call zptcon( n, d, e, anorm, rcond, work, info ) 3 Intel® Math Kernel Library Reference Manual 432 Fortran 95: call ptcon( d, e, anorm, rcond [,info] ) C: lapack_int LAPACKE_sptcon( lapack_int n, const float* d, const float* e, float anorm, float* rcond ); lapack_int LAPACKE_dptcon( lapack_int n, const double* d, const double* e, double anorm, double* rcond ); lapack_int LAPACKE_cptcon( lapack_int n, const float* d, const lapack_complex_float* e, float anorm, float* rcond ); lapack_int LAPACKE_zptcon( lapack_int n, const double* d, const lapack_complex_double* e, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the reciprocal of the condition number (in the 1-norm) of a real symmetric or complex Hermitian positive-definite tridiagonal matrix using the factorization A = L*D*LT for real flavors and A = L*D*LH for complex flavors or A = UT*D*U for real flavors and A = UH*D*U for complex flavors computed by ?pttrf : ?1(A) = ||A||1 ||A-1||1 (since A is symmetric or Hermitian, ?8(A) = ?1(A)). The norm ||A-1|| is computed by a direct method, and the reciprocal of the condition number is computed as rcond = 1 / (||A|| ||A-1||). Before calling this routine: • compute anorm as ||A||1 = maxj Si |aij| • call ?pttrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. d, work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, dimension (n). The array d contains the n diagonal elements of the diagonal matrix D from the factorization of A, as computed by ?pttrf ; work is a workspace array. e REAL for sptcon DOUBLE PRECISION for dptcon COMPLEX for cptcon DOUBLE COMPLEX for zptcon. Array, DIMENSION (n -1). LAPACK Routines: Linear Equations 3 433 Contains off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf . anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The 1- norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtcon interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4*n(kd + 1) floating-point operations for real flavors and 16*n(kd + 1) for complex flavors. ?sycon Estimates the reciprocal of the condition number of a symmetric matrix. Syntax Fortran 77: call ssycon( uplo, n, a, lda, ipiv, anorm, rcond, work, iwork, info ) call dsycon( uplo, n, a, lda, ipiv, anorm, rcond, work, iwork, info ) call csycon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) call zsycon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) Fortran 95: call sycon( a, ipiv, anorm, rcond [,uplo] [,info] ) 3 Intel® Math Kernel Library Reference Manual 434 C: lapack_int LAPACKE_ssycon( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dsycon( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_csycon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zsycon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a symmetric matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?sytrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. a, work REAL for ssycon DOUBLE PRECISION for dsycon COMPLEX for csycon DOUBLE COMPLEX for zsycon. Arrays: a(lda,*), work(*). The array a contains the factored matrix A, as returned by ?sytrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). LAPACK Routines: Linear Equations 3 435 The array ipiv, as returned by ?sytrf. anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sycon interface are as follows: a Holds the matrix A of size (n, n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?syconv Converts a symmetric matrix given by a triangular matrix factorization into two matrices and vice versa. Syntax Fortran 77: call ssyconv( uplo, way, n, a, lda, ipiv, work, info ) call dsyconv( uplo, way, n, a, lda, ipiv, work, info ) call csyconv( uplo, way, n, a, lda, ipiv, work, info ) call zsyconv( uplo, way, n, a, lda, ipiv, work, info ) Fortran 95: call sycon( a[,uplo][,way][,ipiv][,info] ) Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 3 Intel® Math Kernel Library Reference Manual 436 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine converts matrix A, which results from a triangular matrix factorization, into matrices L and D and vice versa. The routine gets non-diagonalized elements of D returned in the workspace and applies or reverses permutation done with the triangular matrix factorization. Input Parameters uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the details of the factorization are stored as an upper or lower triangular matrix: If uplo = 'U': the upper triangular, A = U*D*UT. If uplo = 'L': the lower triangular, A = L*D*LT. way CHARACTER*1. Must be 'C' or 'R'. Indicates whether the routine converts or reverts the matrix: way = 'C' means conversion. way = 'R' means reversion. n INTEGER. The order of matrix A; n = 0. a REAL for ssyconv DOUBLE PRECISION for dsyconv COMPLEX for csyconv DOUBLE COMPLEX for zsyconv Array of DIMENSION (lda,n). The block diagonal matrix D and the multipliers used to obtain the factor U or L as computed by ?sytrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D, as returned by ?sytrf. work INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters info INTEGER. If info = 0, the execution is successful. If info < 0, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syconv interface are as follows: a Holds the matrix A of size (n, n). uplo Must be 'U' or 'L'. way Must be 'C' or 'R'. ipiv Holds the vector of length n. See Also ?sytrf LAPACK Routines: Linear Equations 3 437 ?hecon Estimates the reciprocal of the condition number of a Hermitian matrix. Syntax Fortran 77: call checon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) call zhecon( uplo, n, a, lda, ipiv, anorm, rcond, work, info ) Fortran 95: call hecon( a, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_checon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zhecon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a Hermitian matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is Hermitian, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 =maxj Si |aij| or ||A||8 =maxi Sj |aij|) • call ?hetrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the upper triangular factor U of the factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the lower triangular factor L of the factorization A = P*L*D*LH*PT. n INTEGER. The order of matrix A; n = 0. a, work COMPLEX for checon DOUBLE COMPLEX for zhecon. Arrays: a(lda,*), work(*). 3 Intel® Math Kernel Library Reference Manual 438 The array a contains the factored matrix A, as returned by ?hetrf. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?hetrf. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hecon interface are as follows: a Holds the matrix A of size (n, n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. ?spcon Estimates the reciprocal of the condition number of a packed symmetric matrix. Syntax Fortran 77: call sspcon( uplo, n, ap, ipiv, anorm, rcond, work, iwork, info ) call dspcon( uplo, n, ap, ipiv, anorm, rcond, work, iwork, info ) call cspcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) call zspcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) LAPACK Routines: Linear Equations 3 439 Fortran 95: call spcon( ap, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_sspcon( int matrix_order, char uplo, lapack_int n, const float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_dspcon( int matrix_order, char uplo, lapack_int n, const double* ap, const lapack_int* ipiv, double anorm, double* rcond ); lapack_int LAPACKE_cspcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zspcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, const lapack_int* ipiv, double anorm, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a packed symmetric matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is symmetric, ?8(A) = ?1(A)). Before calling this routine: • compute anorm (either ||A||1 = maxj Si |aij| or ||A||8 = maxi Sj |aij|) • call ?sptrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap, work REAL for sspcon DOUBLE PRECISION for dspcon COMPLEX for cspcon DOUBLE COMPLEX for zspcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? sptrf. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?sptrf. 3 Intel® Math Kernel Library Reference Manual 440 anorm REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond = 0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors and 8n2 for complex flavors. ?hpcon Estimates the reciprocal of the condition number of a packed Hermitian matrix. Syntax Fortran 77: call chpcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) call zhpcon( uplo, n, ap, ipiv, anorm, rcond, work, info ) Fortran 95: call hpcon( ap, ipiv, anorm, rcond [,uplo] [,info] ) C: lapack_int LAPACKE_chpcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, const lapack_int* ipiv, float anorm, float* rcond ); lapack_int LAPACKE_zhpcon( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, const lapack_int* ipiv, double anorm, double* rcond ); LAPACK Routines: Linear Equations 3 441 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a Hermitian matrix A: ?1(A) = ||A||1 ||A-1||1 (since A is Hermitian, ?8(A) = k1(A)). Before calling this routine: • compute anorm (either ||A||1 =maxj Si |aij| or ||A||8 =maxi Sj |aij|) • call ?hptrf to compute the factorization of A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed upper triangular factor U of the factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the packed lower triangular factor L of the factorization A = P*L*D*LT*PT. n INTEGER. The order of matrix A; n = 0. ap, work COMPLEX for chpcon DOUBLE COMPLEX for zhpcon. Arrays: ap(*), work(*). The array ap contains the packed factored matrix A, as returned by ? hptrf. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 2*n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv, as returned by ?hptrf. anorm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. The norm of the original matrix A (see Description). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 442 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. ?trcon Estimates the reciprocal of the condition number of a triangular matrix. Syntax Fortran 77: call strcon( norm, uplo, diag, n, a, lda, rcond, work, iwork, info ) call dtrcon( norm, uplo, diag, n, a, lda, rcond, work, iwork, info ) call ctrcon( norm, uplo, diag, n, a, lda, rcond, work, rwork, info ) call ztrcon( norm, uplo, diag, n, a, lda, rcond, work, rwork, info ) Fortran 95: call trcon( a, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_strcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const float* a, lapack_int lda, float* rcond ); lapack_int LAPACKE_dtrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const double* a, lapack_int lda, double* rcond ); lapack_int LAPACKE_ctrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* rcond ); lapack_int LAPACKE_ztrcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a triangular matrix A in either the 1-norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) LAPACK Routines: Linear Equations 3 443 ?8 (A) =||A||8 ||A-1||8 =k1 (AT) = ?1 (AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array a stores the upper triangle of A, other array elements are not referenced. If uplo = 'L', the array a stores the lower triangle of A, other array elements are not referenced. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a, work REAL for strcon DOUBLE PRECISION for dtrcon COMPLEX for ctrcon DOUBLE COMPLEX for ztrcon. Arrays: a(lda,*), work(*). The array a contains the matrix A. The second dimension of a must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctrcon DOUBLE PRECISION for ztrcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 444 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trcon interface are as follows: a Holds the matrix A of size (n, n). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors and 4n2 operations for complex flavors. ?tpcon Estimates the reciprocal of the condition number of a packed triangular matrix. Syntax Fortran 77: call stpcon( norm, uplo, diag, n, ap, rcond, work, iwork, info ) call dtpcon( norm, uplo, diag, n, ap, rcond, work, iwork, info ) call ctpcon( norm, uplo, diag, n, ap, rcond, work, rwork, info ) call ztpcon( norm, uplo, diag, n, ap, rcond, work, rwork, info ) Fortran 95: call tpcon( ap, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_stpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const float* ap, float* rcond ); lapack_int LAPACKE_dtpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const double* ap, double* rcond ); lapack_int LAPACKE_ctpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_float* ap, float* rcond ); lapack_int LAPACKE_ztpcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, const lapack_complex_double* ap, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Linear Equations 3 445 The routine estimates the reciprocal of the condition number of a packed triangular matrix A in either the 1- norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) ?8(A) =||A||8 ||A-1||8 =?1 (AT) = ?1(AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array ap stores the upper triangle of A in packed form. If uplo = 'L', the array ap stores the lower triangle of A in packed form. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for stpcon DOUBLE PRECISION for dtpcon COMPLEX for ctpcon DOUBLE COMPLEX for ztpcon. Arrays: ap(*), work(*). The array ap contains the packed matrix A. The dimension of ap must be at least max(1,n(n+1)/2). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctpcon DOUBLE PRECISION for ztpcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. 3 Intel® Math Kernel Library Reference Manual 446 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tpcon interface are as follows: ap Holds the array A of size (n*(n+1)/2). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors and 4n2 operations for complex flavors. ?tbcon Estimates the reciprocal of the condition number of a triangular band matrix. Syntax Fortran 77: call stbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, iwork, info ) call dtbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, iwork, info ) call ctbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, rwork, info ) call ztbcon( norm, uplo, diag, n, kd, ab, ldab, rcond, work, rwork, info ) Fortran 95: call tbcon( ab, rcond [,uplo] [,diag] [,norm] [,info] ) C: lapack_int LAPACKE_stbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float* rcond ); lapack_int LAPACKE_dtbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double* rcond ); lapack_int LAPACKE_ctbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float* rcond ); lapack_int LAPACKE_ztbcon( int matrix_order, char norm, char uplo, char diag, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double* rcond ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 LAPACK Routines: Linear Equations 3 447 • C: mkl_lapacke.h Description The routine estimates the reciprocal of the condition number of a triangular band matrix A in either the 1- norm or infinity-norm: ?1(A) =||A||1 ||A-1||1 = ?8(AT) = ?8(AH) ?8(A) =||A||8 ||A-1||8 =?1 (AT) = ?1(AH) . Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. norm CHARACTER*1. Must be '1' or 'O' or 'I'. If norm = '1' or 'O', then the routine estimates the condition number of matrix A in 1-norm. If norm = 'I', then the routine estimates the condition number of matrix A in infinity-norm. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', the array ap stores the upper triangle of A in packed form. If uplo = 'L', the array ap stores the lower triangle of A in packed form. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab, work REAL for stbcon DOUBLE PRECISION for dtbcon COMPLEX for ctbcon DOUBLE COMPLEX for ztbcon. Arrays: ab(ldab,*), work(*). The array ab contains the band matrix A. The second dimension of ab must be at least max(1,n). The array work is a workspace for the routine. The dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab. (ldab = kd +1). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctbcon DOUBLE PRECISION for ztbcon. Workspace array, DIMENSION at least max(1, n). Output Parameters rcond REAL for single precision flavors. 3 Intel® Math Kernel Library Reference Manual 448 DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal of the condition number. The routine sets rcond =0 if the estimate underflows; in this case the matrix is singular (to working precision). However, anytime rcond is small compared to 1.0, for the working precision, the matrix may be poorly conditioned or even singular. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbcon interface are as follows: ab Holds the array A of size (kd+1,n). norm Must be '1', 'O', or 'I'. The default value is '1'. uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed rcond is never less than r (the reciprocal of the true condition number) and in practice is nearly always less than 10r. A call to this routine involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2*n(kd + 1) floating-point operations for real flavors and 8*n(kd + 1) operations for complex flavors. Refining the Solution and Estimating Its Error This section describes the LAPACK routines for refining the computed solution of a system of linear equations and estimating the solution error. You can call these routines after factorizing the matrix of the system of equations and computing the solution (see Routines for Matrix Factorization and Routines for Solving Systems of Linear Equations). ?gerfs Refines the solution of a system of linear equations with a general matrix and estimates its error. Syntax Fortran 77: call sgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgerfs( trans, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gerfs( a, af, ipiv, b, x [,trans] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 449 C: lapack_int LAPACKE_sgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgerfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a general matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?getrf • call the solver routine ?getrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. 3 Intel® Math Kernel Library Reference Manual 450 a,af,b,x,work REAL for sgerfs DOUBLE PRECISION for dgerfs COMPLEX for cgerfs DOUBLE COMPLEX for zgerfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?getrf. af(ldaf,*) contains the factored matrix A, as returned by ?getrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgerfs DOUBLE PRECISION for zgerfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerfs interface are as follows: a Holds the matrix A of size (n, n). af Holds the matrix AF of size (n, n). ipiv Holds the vector of length n. b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). LAPACK Routines: Linear Equations 3 451 berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?gerfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a general matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgerfsx( trans, equed, n, nrhs, a, lda, af, ldaf, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* r, const float* c, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* r, const double* c, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* r, 3 Intel® Math Kernel Library Reference Manual 452 const float* c, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zgerfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* r, const double* c, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed, r, and c below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate transpose = Transpose). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgerfsx DOUBLE PRECISION for dgerfsx LAPACK Routines: Linear Equations 3 453 COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the original n-by-n matrix A. The array af contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices as computed by ?getrf; for row 1 = i = n, row i of the matrix was interchanged with row ipiv(i). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sgerfsx DOUBLE PRECISION for dgerfsx COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?getrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used 3 Intel® Math Kernel Library Reference Manual 454 for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgerfsx DOUBLE PRECISION for dgerfsx COMPLEX for cgerfsx DOUBLE COMPLEX for zgerfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. LAPACK Routines: Linear Equations 3 455 err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: 3 Intel® Math Kernel Library Reference Manual 456 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested LAPACK Routines: Linear Equations 3 457 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gbrfs Refines the solution of a system of linear equations with a general band matrix and estimates its error. Syntax Fortran 77: call sgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgbrfs( trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gbrfs( ab, afb, ipiv, b, x [,kl] [,trans] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgbrfs( int matrix_order, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 3 Intel® Math Kernel Library Reference Manual 458 Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?gbtrf • call the solver routine ?gbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. kl INTEGER. The number of sub-diagonals within the band of A; kl = 0. ku INTEGER. The number of super-diagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab,afb,b,x,work REAL for sgbrfs DOUBLE PRECISION for dgbrfs COMPLEX for cgbrfs DOUBLE COMPLEX for zgbrfs. Arrays: ab(ldab,*) contains the original band matrix A, as supplied to ? gbtrf, but stored in rows from 1 to kl + ku + 1. afb(ldafb,*) contains the factored band matrix A, as returned by ? gbtrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of ab and afb must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of ab. ldafb INTEGER. The leading dimension of afb . ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. LAPACK Routines: Linear Equations 3 459 Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gbtrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cgbrfs DOUBLE PRECISION for zgbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info =0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbrfs interface are as follows: ab Holds the array A of size (kl+ku+1,n). afb Holds the array AF of size (2*kl*ku+1,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n(kl + ku) floatingpoint operations (for real flavors) or 16n(kl + ku) operations (for complex flavors). In addition, each step of iterative refinement involves 2n(4kl + 3ku) operations (for real flavors) or 8n(4kl + 3ku) operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. 3 Intel® Math Kernel Library Reference Manual 460 ?gbrfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a banded matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgbrfsx( trans, equed, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, r, c, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* r, const float* c, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* r, const double* c, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_int* ipiv, const float* r, const float* c, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zgbrfsx( int matrix_order, char trans, char equed, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_int* ipiv, const double* r, const double* c, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); LAPACK Routines: Linear Equations 3 461 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed, r, and c below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate transpose = Transpose). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the original matrix A in band storage, in rows 1 to kl +ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl). 3 Intel® Math Kernel Library Reference Manual 462 The array afb contains details of the LU factorization of the banded matrix A as computed by ?gbtrf. U is stored as an upper triangular banded matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1. The multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kl+ku+1. ldafb INTEGER. The leading dimension of the array afb; ldafb = 2*kl+ku+1. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains the pivot indices as computed by ?gbtrf; for row 1 = i = n, row i of the matrix was interchanged with row ipiv(i). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by sgbtrs/dgbtrs for real flavors or cgbtrs/zgbtrs for complex flavors. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right-hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used LAPACK Routines: Linear Equations 3 463 for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgbrfsx DOUBLE PRECISION for dgbrfsx COMPLEX for cgbrfsx DOUBLE COMPLEX for zgbrfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. 3 Intel® Math Kernel Library Reference Manual 464 err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the following three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: LAPACK Routines: Linear Equations 3 465 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested 3 Intel® Math Kernel Library Reference Manual 466 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gtrfs Refines the solution of a system of linear equations with a tridiagonal matrix and estimates its error. Syntax Fortran 77: call sgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zgtrfs( trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call gtrfs( dl, d, du, dlf, df, duf, du2, ipiv, b, x [,trans] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const float* dl, const float* d, const float* du, const float* dlf, const float* df, const float* duf, const float* du2, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const double* dl, const double* d, const double* du, const double* dlf, const double* df, const double* duf, const double* du2, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, const lapack_complex_float* dlf, const lapack_complex_float* df, const lapack_complex_float* duf, const lapack_complex_float* du2, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zgtrfs( int matrix_order, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, const lapack_complex_double* dlf, const lapack_complex_double* df, const lapack_complex_double* duf, const lapack_complex_double* du2, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 467 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a tridiagonal matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij|/|aij| = ß|aij|, |dbi|/|bi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?gttrf • call the solver routine ?gttrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. dl,d,du,dlf, df,duf,du2, b,x,work REAL for sgtrfs DOUBLE PRECISION for dgtrfs COMPLEX for cgtrfs DOUBLE COMPLEX for zgtrfs. Arrays: dl, dimension (n -1), contains the subdiagonal elements of A. d, dimension (n), contains the diagonal elements of A. du, dimension (n -1), contains the superdiagonal elements of A. dlf, dimension (n -1), contains the (n - 1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. df, dimension (n), contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf, dimension (n -1), contains the (n - 1) elements of the first superdiagonal of U. du2, dimension (n -2), contains the (n - 2) elements of the second superdiagonal of U. b(ldb,nrhs) contains the right-hand side matrix B. x(ldx,nrhs) contains the solution matrix X, as computed by ?gttrs. work(*) is a workspace array; the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). 3 Intel® Math Kernel Library Reference Manual 468 ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?gttrf. iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. rwork REAL for cgtrfs DOUBLE PRECISION for zgtrfs. Workspace array, DIMENSION (n). Used for complex flavors only. Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1,nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtrfs interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). dlf Holds the vector of length (n-1). df Holds the vector of length n. duf Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. ?porfs Refines the solution of a system of linear equations with a symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call sporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 469 call dporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zporfs( uplo, n, nrhs, a, lda, af, ldaf, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call porfs( a, af, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zporfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?potrf • call the solver routine ?potrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 3 Intel® Math Kernel Library Reference Manual 470 uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work REAL for sporfs DOUBLE PRECISION for dporfs COMPLEX for cporfs DOUBLE COMPLEX for zporfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?potrf. af(ldaf,*) contains the factored matrix A, as returned by ?potrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cporfs DOUBLE PRECISION for zporfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine porfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). LAPACK Routines: Linear Equations 3 471 ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?porfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric/Hermitian positive-definite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call sporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zporfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const float* s, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const double* s, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_cporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); 3 Intel® Math Kernel Library Reference Manual 472 lapack_int LAPACKE_zporfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const double* s, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric/Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower LAPACK Routines: Linear Equations 3 473 triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af contains the triangular factor L or U from the Cholesky factorization A = U**T*U or A = L*L**T as computed by spotrf for real flavors or dpotrf for complex flavors. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?potrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. 3 Intel® Math Kernel Library Reference Manual 474 =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sporfsx DOUBLE PRECISION for dporfsx COMPLEX for cporfsx DOUBLE COMPLEX for zporfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector LAPACK Routines: Linear Equations 3 475 The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. 3 Intel® Math Kernel Library Reference Manual 476 The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values. namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. LAPACK Routines: Linear Equations 3 477 ?pprfs Refines the solution of a system of linear equations with a packed symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call spprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zpprfs( uplo, n, nrhs, ap, afp, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call pprfs( ap, afp, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_spprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* ap, const float* afp, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* ap, const double* afp, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zpprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed symmetric (Hermitian)positive definite matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/||x||8 where xe is the exact solution. Before calling this routine: 3 Intel® Math Kernel Library Reference Manual 478 • call the factorization routine ?pptrf • call the solver routine ?pptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, afp, b, x, work REAL for spprfs DOUBLE PRECISION for dpprfs COMPLEX for cpprfs DOUBLE COMPLEX for zpprfs. Arrays: ap(*) contains the original packed matrix A, as supplied to ?pptrf. afp(*) contains the factored packed matrix A, as returned by ? pptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpprfs DOUBLE PRECISION for zpprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pprfs interface are as follows: LAPACK Routines: Linear Equations 3 479 ap Holds the array A of size (n*(n+1)/2). afp Holds the array AF of size (n*(n+1)/2). b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?pbrfs Refines the solution of a system of linear equations with a band symmetric (Hermitian) positive-definite matrix and estimates its error. Syntax Fortran 77: call spbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call cpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zpbrfs( uplo, n, kd, nrhs, ab, ldab, afb, ldafb, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call pbrfs( ab, afb, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_spbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const float* ab, lapack_int ldab, const float* afb, lapack_int ldafb, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const double* ab, lapack_int ldab, const double* afb, lapack_int ldafb, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); 3 Intel® Math Kernel Library Reference Manual 480 lapack_int LAPACKE_cpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* afb, lapack_int ldafb, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zpbrfs( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* afb, lapack_int ldafb, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?pbtrf • call the solver routine ?pbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab,afb,b,x,work REAL for spbrfs DOUBLE PRECISION for dpbrfs COMPLEX for cpbrfs DOUBLE COMPLEX for zpbrfs. Arrays: ab(ldab,*) contains the original band matrix A, as supplied to ? pbtrf. afb(ldafb,*) contains the factored band matrix A, as returned by ? pbtrf. LAPACK Routines: Linear Equations 3 481 b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of ab and afb must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kd + 1. ldafb INTEGER. The leading dimension of afb; ldafb = kd + 1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for cpbrfs DOUBLE PRECISION for zpbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbrfs interface are as follows: ab Holds the array A of size (kd+1, n). afb Holds the array AF of size (kd+1, n). b Holds the matrix B of size (n, nrhs). x Holds the matrix X of size (n, nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 8n*kd floating-point operations (for real flavors) or 32n*kd operations (for complex flavors). In addition, each step of iterative refinement involves 12n*kd operations (for real flavors) or 48n*kd operations (for complex flavors); the number of iterations may range from 1 to 5. 3 Intel® Math Kernel Library Reference Manual 482 Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 4n*kd floating-point operations for real flavors or 16n*kd for complex flavors. ?ptrfs Refines the solution of a system of linear equations with a symmetric (Hermitian) positive-definite tridiagonal matrix and estimates its error. Syntax Fortran 77: call sptrfs( n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, info ) call dptrfs( n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, info ) call cptrfs( uplo, n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zptrfs( uplo, n, nrhs, d, e, df, ef, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call ptrfs( d, df, e, ef, b, x [,ferr] [,berr] [,info] ) call ptrfs( d, df, e, ef, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_sptrfs( int matrix_order, lapack_int n, lapack_int nrhs, const float* d, const float* e, const float* df, const float* ef, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dptrfs( int matrix_order, lapack_int n, lapack_int nrhs, const double* d, const double* e, const double* df, const double* ef, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_cptrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, const float* df, const lapack_complex_float* ef, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zptrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, const double* df, const lapack_complex_double* ef, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Linear Equations 3 483 The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric (Hermitian) positive definite tridiagonal matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?pttrf • call the solver routine ?pttrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Used for complex flavors only. Must be 'U' or 'L'. Specifies whether the superdiagonal or the subdiagonal of the tridiagonal matrix A is stored and how A is factored: If uplo = 'U', the array e stores the superdiagonal of A, and A is factored as UH*D*U. If uplo = 'L', the array e stores the subdiagonal of A, and A is factored as L*D*LH. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. d, df, rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors Arrays: d(n), df(n), rwork(n). The array d contains the n diagonal elements of the tridiagonal matrix A. The array df contains the n diagonal elements of the diagonal matrix D from the factorization of A as computed by ?pttrf. The array rwork is a workspace array used for complex flavors only. e,ef,b,x,work REAL for sptrfs DOUBLE PRECISION for dptrfs COMPLEX for cptrfs DOUBLE COMPLEX for zptrfs. Arrays: e(n -1), ef(n -1), b(ldb,nrhs), x(ldx,nrhs), work(*). The array e contains the (n - 1) off-diagonal elements of the tridiagonal matrix A (see uplo). The array ef contains the (n - 1) off-diagonal elements of the unit bidiagonal factor U or L from the factorization computed by ?pttrf (see uplo). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The array x contains the solution matrix X as computed by ?pttrs. The array work is a workspace array. The dimension of work must be at least 2*n for real flavors, and at least n for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). 3 Intel® Math Kernel Library Reference Manual 484 ldx INTEGER. The leading dimension of x; ldx = max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptrfs interface are as follows: d Holds the vector of length n. df Holds the vector of length n. e Holds the vector of length (n-1). ef Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Used in complex flavors only. Must be 'U' or 'L'. The default value is 'U'. ?syrfs Refines the solution of a system of linear equations with a symmetric matrix and estimates its error. Syntax Fortran 77: call ssyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dsyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call csyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zsyrfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call syrfs( a, af, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 485 C: lapack_int LAPACKE_ssyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dsyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_csyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zsyrfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a symmetric full-storage matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?sytrf • call the solver routine ?sytrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work REAL for ssyrfs DOUBLE PRECISION for dsyrfs 3 Intel® Math Kernel Library Reference Manual 486 COMPLEX for csyrfs DOUBLE COMPLEX for zsyrfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?sytrf. af(ldaf,*) contains the factored matrix A, as returned by ?sytrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for csyrfs DOUBLE PRECISION for zsyrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine syrfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 487 Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?syrfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric indefinite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call ssyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dsyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call csyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zsyrfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_ssyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_dsyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); lapack_int LAPACKE_csyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zsyrfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, 3 Intel® Math Kernel Library Reference Manual 488 const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine improves the computed solution to a system of linear equations when the coefficient matrix is symmetric indefinite, and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric/Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). LAPACK Routines: Linear Equations 3 489 The array af contains the triangular factor L or U from the Cholesky factorization A = U**T*U or A = L*L**T as computed by ssytrf for real flavors or dsytrf for complex flavors. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D as determined by ssytrf for real flavors or dsytrf for complex flavors. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?sytrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. 3 Intel® Math Kernel Library Reference Manual 490 =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for ssyrfsx DOUBLE PRECISION for dsyrfsx COMPLEX for csyrfsx DOUBLE COMPLEX for zsyrfsx. The improved solution matrix X. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector LAPACK Routines: Linear Equations 3 491 The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. 3 Intel® Math Kernel Library Reference Manual 492 The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. LAPACK Routines: Linear Equations 3 493 ?herfs Refines the solution of a system of linear equations with a complex Hermitian matrix and estimates its error. Syntax Fortran 77: call cherfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zherfs( uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call herfs( a, af, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_cherfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zherfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a complex Hermitian full-storage matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?hetrf • call the solver routine ?hetrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. 3 Intel® Math Kernel Library Reference Manual 494 If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a,af,b,x,work COMPLEX for cherfs DOUBLE COMPLEX for zherfs. Arrays: a(lda,*) contains the original matrix A, as supplied to ?hetrf. af(ldaf,*) contains the factored matrix A, as returned by ?hetrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a and af must be at least max(1, n); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. rwork REAL for cherfs DOUBLE PRECISION for zherfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for cherfs DOUBLE PRECISION for zherfs. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine herfs interface are as follows: a Holds the matrix A of size (n,n). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). LAPACK Routines: Linear Equations 3 495 berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 16n2 operations. In addition, each step of iterative refinement involves 24n2 operations; the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. The real counterpart of this routine is ?ssyrfs/?dsyrfs ?herfsx Uses extra precise iterative refinement to improve the solution to the system of linear equations with a symmetric indefinite matrix A and provides error bounds and backward error estimates. Syntax Fortran 77: call cherfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zherfsx( uplo, equed, n, nrhs, a, lda, af, ldaf, ipiv, s, b, ldb, x, ldx, rcond, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_cherfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* af, lapack_int ldaf, const lapack_int* ipiv, const float* s, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, float* params ); lapack_int LAPACKE_zherfsx( int matrix_order, char uplo, char equed, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* af, lapack_int ldaf, const lapack_int* ipiv, const double* s, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description 3 Intel® Math Kernel Library Reference Manual 496 The routine improves the computed solution to a system of linear equations when the coefficient matrix is Hermitian indefinite, and provides error bounds and backward error estimates for the solution. In addition to a normwise error bound, the code provides a maximum componentwise error bound, if possible. See comments for err_bnds_norm and err_bnds_comp for details of the error bounds. The original system of linear equations may have been equilibrated before calling this routine, as described by the parameters equed and s below. In this case, the solution and error bounds returned are for the original unequilibrated system. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. equed CHARACTER*1. Must be 'N' or 'Y'. Specifies the form of equilibration that was done to A before calling this routine. If equed = 'N', no equilibration was done. If equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). The right-hand side B has been changed accordingly. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The factored form of the matrix A. The array af contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ssytrf for cherfsx or dsytrf for zherfsx. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. LAPACK Routines: Linear Equations 3 497 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D as determined by ssytrf for real flavors or dsytrf for complex flavors. s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'N', s is not accessed. If equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). x COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. Array, DIMENSION (ldx,*). The solution matrix X as computed by ?hetrs ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry will be filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for cherfsx), 1.0D+0 (for zherfsx). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. 3 Intel® Math Kernel Library Reference Manual 498 params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). rwork REAL for cherfsx DOUBLE PRECISION for zherfsx. Workspace array, DIMENSION at least max(1, 3*n). Output Parameters x COMPLEX for cherfsx DOUBLE COMPLEX for zherfsx. The improved solution matrix X. rcond REAL for cherfsx DOUBLE PRECISION for zherfsx. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. berr REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. This error bound should only be trusted if the previous boolean is true. LAPACK Routines: Linear Equations 3 499 err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for cherfsx DOUBLE PRECISION for zherfsx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for cherfsx and sqrt(n)*dlamch(e) for zherfsx to determine if the error estimate is "guaranteed". These 3 Intel® Math Kernel Library Reference Manual 500 reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Output parameter only if the input contains erroneous values, namely, in params(1), params(2), params(3). In such a case, the corresponding elements of params are filled with default values on output. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?sprfs Refines the solution of a system of linear equations with a packed symmetric matrix and estimates the solution error. Syntax Fortran 77: call ssprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dsprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call csprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zsprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call sprfs( ap, afp, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) LAPACK Routines: Linear Equations 3 501 C: lapack_int LAPACKE_ssprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const float* ap, const float* afp, const lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dsprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const double* ap, const double* afp, const lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_csprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zsprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed symmetric matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?sptrf • call the solver routine ?sptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap,afp,b,x,work REAL for ssprfs DOUBLE PRECISION for dsprfs COMPLEX for csprfs DOUBLE COMPLEX for zsprfs. 3 Intel® Math Kernel Library Reference Manual 502 Arrays: ap(*) contains the original packed matrix A, as supplied to ?sptrf. afp(*) contains the factored packed matrix A, as returned by ? sptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b and x must be at least max(1, nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for csprfs DOUBLE PRECISION for zsprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. LAPACK Routines: Linear Equations 3 503 For each right-hand side, computation of the backward error involves a minimum of 4n2 floating-point operations (for real flavors) or 16n2 operations (for complex flavors). In addition, each step of iterative refinement involves 6n2 operations (for real flavors) or 24n2 operations (for complex flavors); the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n2 floating-point operations for real flavors or 8n2 for complex flavors. ?hprfs Refines the solution of a system of linear equations with a packed complex Hermitian matrix and estimates the solution error. Syntax Fortran 77: call chprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call zhprfs( uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call hprfs( ap, afp, ipiv, b, x [,uplo] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_chprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* afp, const lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_zhprfs( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* afp, const lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine performs an iterative refinement of the solution to a system of linear equations A*X = B with a packed complex Hermitian matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). Finally, the routine estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine: • call the factorization routine ?hptrf 3 Intel® Math Kernel Library Reference Manual 504 • call the solver routine ?hptrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap,afp,b,x,work COMPLEX for chprfs DOUBLE COMPLEX for zhprfs. Arrays: ap(*) contains the original packed matrix A, as supplied to ?hptrf. afp(*) contains the factored packed matrix A, as returned by ? hptrf. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1, 2*n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. rwork REAL for chprfs DOUBLE PRECISION for zhprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters x The refined solution matrix X. ferr, berr REAL for chprfs. DOUBLE PRECISION for zhprfs. Arrays, DIMENSION at least max(1,nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). LAPACK Routines: Linear Equations 3 505 afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector of length n. b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. For each right-hand side, computation of the backward error involves a minimum of 16n2 operations. In addition, each step of iterative refinement involves 24n2 operations; the number of iterations may range from 1 to 5. Estimating the forward error involves solving a number of systems of linear equations A*x = b; the number is usually 4 or 5 and never more than 11. Each solution requires approximately 8n2 floating-point operations. The real counterpart of this routine is ?ssprfs/?dsprfs. ?trrfs Estimates the error in the solution of a system of linear equations with a triangular matrix. Syntax Fortran 77: call strrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztrrfs( uplo, trans, diag, n, nrhs, a, lda, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call trrfs( a, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_strrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_ctrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); 3 Intel® Math Kernel Library Reference Manual 506 lapack_int LAPACKE_ztrrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a triangular matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?trtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', then A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. a, b, x, work REAL for strrfs DOUBLE PRECISION for dtrrfs COMPLEX for ctrrfs DOUBLE COMPLEX for ztrrfs. Arrays: a(lda,*) contains the upper or lower triangular matrix A, as specified by uplo. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. LAPACK Routines: Linear Equations 3 507 work(*) is a workspace array. The second dimension of a must be at least max(1,n); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctrrfs DOUBLE PRECISION for ztrrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trrfs interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors or 4n2 for complex flavors. ?tprfs Estimates the error in the solution of a system of linear equations with a packed triangular matrix. 3 Intel® Math Kernel Library Reference Manual 508 Syntax Fortran 77: call stprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztprfs( uplo, trans, diag, n, nrhs, ap, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call tprfs( ap, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_stprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const float* ap, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const double* ap, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); lapack_int LAPACKE_ctprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_ztprfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a packed triangular matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?tptrs. LAPACK Routines: Linear Equations 3 509 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ap, b, x, work REAL for stprfs DOUBLE PRECISION for dtprfs COMPLEX for ctprfs DOUBLE COMPLEX for ztprfs. Arrays: ap(*) contains the upper or lower triangular matrix A, as specified by uplo. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The dimension of ap must be at least max(1,n(n+1)/2); the second dimension of b and x must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctprfs DOUBLE PRECISION for ztprfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 510 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tprfs interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately n2 floating-point operations for real flavors or 4n2 for complex flavors. ?tbrfs Estimates the error in the solution of a system of linear equations with a triangular band matrix. Syntax Fortran 77: call stbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call dtbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, iwork, info ) call ctbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, rwork, info ) call ztbrfs( uplo, trans, diag, n, kd, nrhs, ab, ldab, b, ldb, x, ldx, ferr, berr, work, rwork, info ) Fortran 95: call tbrfs( ab, b, x [,uplo] [,trans] [,diag] [,ferr] [,berr] [,info] ) C: lapack_int LAPACKE_stbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const float* ab, lapack_int ldab, const float* b, lapack_int ldb, const float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_dtbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const double* ab, lapack_int ldab, const double* b, lapack_int ldb, const double* x, lapack_int ldx, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 511 lapack_int LAPACKE_ctbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_float* ab, lapack_int ldab, const lapack_complex_float* b, lapack_int ldb, const lapack_complex_float* x, lapack_int ldx, float* ferr, float* berr ); lapack_int LAPACKE_ztbrfs( int matrix_order, char uplo, char trans, char diag, lapack_int n, lapack_int kd, lapack_int nrhs, const lapack_complex_double* ab, lapack_int ldab, const lapack_complex_double* b, lapack_int ldb, const lapack_complex_double* x, lapack_int ldx, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine estimates the errors in the solution to a system of linear equations A*X = B or AT*X = B or AH*X = B with a triangular band matrix A, with multiple right-hand sides. For each computed solution vector x, the routine computes the component-wise backward error ß. This error is the smallest relative perturbation in elements of A and b such that x is the exact solution of the perturbed system: |daij| = ß|aij|, |dbi| = ß|bi| such that (A + dA)x = (b + db). The routine also estimates the component-wise forward error in the computed solution ||x - xe||8/|| x||8 (here xe is the exact solution). Before calling this routine, call the solver routine ?tbtrs. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. trans CHARACTER*1. Must be 'N' or 'T' or 'C'. Indicates the form of the equations: If trans = 'N', the system has the form A*X = B. If trans = 'T', the system has the form AT*X = B. If trans = 'C', the system has the form AH*X = B. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ab. n INTEGER. The order of the matrix A; n = 0. kd INTEGER. The number of super-diagonals or sub-diagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides; nrhs = 0. ab, b, x, work REAL for stbrfs DOUBLE PRECISION for dtbrfs COMPLEX for ctbrfs 3 Intel® Math Kernel Library Reference Manual 512 DOUBLE COMPLEX for ztbrfs. Arrays: ab(ldab,*) contains the upper or lower triangular matrix A, as specified by uplo, in band storage format. b(ldb,*) contains the right-hand side matrix B. x(ldx,*) contains the solution matrix X. work(*) is a workspace array. The second dimension of a must be at least max(1,n); the second dimension of b and x must be at least max(1,nrhs). The dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n). rwork REAL for ctbrfs DOUBLE PRECISION for ztbrfs. Workspace array, DIMENSION at least max(1, n). Output Parameters ferr, berr REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tbrfs interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The bounds returned in ferr are not rigorous, but in practice they almost always overestimate the actual error. LAPACK Routines: Linear Equations 3 513 A call to this routine involves, for each right-hand side, solving a number of systems of linear equations A*x = b; the number of systems is usually 4 or 5 and never more than 11. Each solution requires approximately 2n*kd floating-point operations for real flavors or 8n*kd operations for complex flavors. Routines for Matrix Inversion It is seldom necessary to compute an explicit inverse of a matrix. In particular, do not attempt to solve a system of equations Ax = b by first computing A-1 and then forming the matrix-vector product x = A-1b. Call a solver routine instead (see Routines for Solving Systems of Linear Equations); this is more efficient and more accurate. However, matrix inversion routines are provided for the rare occasions when an explicit inverse matrix is needed. ?getri Computes the inverse of an LU-factored general matrix. Syntax Fortran 77: call sgetri( n, a, lda, ipiv, work, lwork, info ) call dgetri( n, a, lda, ipiv, work, lwork, info ) call cgetri( n, a, lda, ipiv, work, lwork, info ) call zgetri( n, a, lda, ipiv, work, lwork, info ) Fortran 95: call getri( a, ipiv [,info] ) C: lapack_int LAPACKE_getri( int matrix_order, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a general matrix A. Before calling this routine, call ?getrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a, work REAL for sgetri DOUBLE PRECISION for dgetri COMPLEX for cgetri 3 Intel® Math Kernel Library Reference Manual 514 DOUBLE COMPLEX for zgetri. Arrays: a(lda,*), work(*). a(lda,*) contains the factorization of the matrix A, as returned by ? getrf: A = P*L*U. The second dimension of a must be at least max(1,n). work(*) is a workspace array of dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?getrf. lwork INTEGER. The size of the work array; lwork = n. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for the suggested value of lwork. Output Parameters a Overwritten by the n-by-n matrix inv(A). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the factor U is zero, U is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine getri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. LAPACK Routines: Linear Equations 3 515 Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed inverse X satisfies the following error bound: |XA - I| = c(n)e|X|P|L||U|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix; P, L, and U are the factors of the matrix factorization A = P*L*U. The total number of floating-point operations is approximately (4/3)n3 for real flavors and (16/3)n3 for complex flavors. ?potri Computes the inverse of a symmetric (Hermitian) positive-definite matrix. Syntax Fortran 77: call spotri( uplo, n, a, lda, info ) call dpotri( uplo, n, a, lda, info ) call cpotri( uplo, n, a, lda, info ) call zpotri( uplo, n, a, lda, info ) Fortran 95: call potri( a [,uplo] [,info] ) C: lapack_int LAPACKE_potri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex flavors, Hermitian positive-definite matrix A. Before calling this routine, call ?potrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. n INTEGER. The order of the matrix A; n = 0. a REAL for spotri DOUBLE PRECISION for dpotri 3 Intel® Math Kernel Library Reference Manual 516 COMPLEX for cpotri DOUBLE COMPLEX for zpotri. Array a(lda,*). Contains the factorization of the matrix A, as returned by ?potrf. The second dimension of a must be at least max(1, n). lda INTEGER. The leading dimension of a; lda = max(1, n). Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the Cholesky factor (and therefore the factor itself) is zero, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine potri interface are as follows: a Holds the matrix A of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: ||XA - I||2 = c(n)e?2(A), ||AX - I||2 = c(n)e?2(A), where c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The 2-norm ||A||2 of a matrix A is defined by ||A||2 = maxx·x=1(Ax·Ax)1/2, and the condition number ?2(A) is defined by ?2(A) = ||A||2 ||A-1||2. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?pftri Computes the inverse of a symmetric (Hermitian) positive-definite matrix in RFP format using the Cholesky factorization. Syntax Fortran 77: call spftri( transr, uplo, n, a, info ) call dpftri( transr, uplo, n, a, info ) call cpftri( transr, uplo, n, a, info ) call zpftri( transr, uplo, n, a, info ) LAPACK Routines: Linear Equations 3 517 C: lapack_int LAPACKE_pftri( int matrix_order, char transr, char uplo, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex data, Hermitian positive-definite matrix A using the Cholesky factorization: A = UT*U for real data, A = UH*U for complex data if uplo='U' A = L*LT for real data, A = L*LH for complex data if uplo='L' Before calling this routine, call ?pftrf to factorize A. The matrix A is in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of the RFP matrix A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a REAL for spftri DOUBLE PRECISION for dpftri COMPLEX for cpftri DOUBLE COMPLEX for zpftri. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The symmetric/Hermitian inverse of the original matrix in the same storage format. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 3 Intel® Math Kernel Library Reference Manual 518 If info = i, the (i,i) element of the factor U or L is zero, and the inverse could not be computed. ?pptri Computes the inverse of a packed symmetric (Hermitian) positive-definite matrix Syntax Fortran 77: call spptri( uplo, n, ap, info ) call dpptri( uplo, n, ap, info ) call cpptri( uplo, n, ap, info ) call zpptri( uplo, n, ap, info ) Fortran 95: call pptri( ap [,uplo] [,info] ) C: lapack_int LAPACKE_pptri( int matrix_order, char uplo, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric positive definite or, for complex flavors, Hermitian positive-definite matrix A in packed form. Before calling this routine, call ?pptrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular factor is stored in ap: If uplo = 'U', then the upper triangular factor is stored. If uplo = 'L', then the lower triangular factor is stored. n INTEGER. The order of the matrix A; n = 0. ap REAL for spptri DOUBLE PRECISION for dpptri COMPLEX for cpptri DOUBLE COMPLEX for zpptri. Array, DIMENSION at least max(1, n(n+1)/2). Contains the factorization of the packed matrix A, as returned by ? pptrf. The dimension ap must be at least max(1,n(n+1)/2). LAPACK Routines: Linear Equations 3 519 Output Parameters ap Overwritten by the packed n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of the Cholesky factor (and therefore the factor itself) is zero, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: ||XA - I||2 = c(n)e?2(A), ||AX - I||2 = c(n)e?2(A), where c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The 2-norm ||A||2 of a matrix A is defined by ||A||2 =maxx·x=1(Ax·Ax)1/2, and the condition number ?2(A) is defined by ?2(A) = ||A||2 ||A-1||2 . The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?sytri Computes the inverse of a symmetric matrix. Syntax Fortran 77: call ssytri( uplo, n, a, lda, ipiv, work, info ) call dsytri( uplo, n, a, lda, ipiv, work, info ) call csytri( uplo, n, a, lda, ipiv, work, info ) call zsytri( uplo, n, a, lda, ipiv, work, info ) Fortran 95: call sytri( a, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sytri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 3 Intel® Math Kernel Library Reference Manual 520 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric matrix A. Before calling this routine, call ?sytrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the Bunch-Kaufman factorization A = P*U*D*UT*PT. If uplo = 'L', the array a stores the Bunch-Kaufman factorization A = P*L*D*LT*PT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri DOUBLE PRECISION for dsytri COMPLEX for csytri DOUBLE COMPLEX for zsytri. Arrays: a(lda,*) contains the factorization of the matrix A, as returned by ? sytrf. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sytrf. Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. LAPACK Routines: Linear Equations 3 521 Application Notes The computed inverse X satisfies the following error bounds: |D*UT*PT*X*P*U - I| = c(n)e(|D||UT|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LT*PT*X*P*L - I| = c(n)e(|D||LT|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?hetri Computes the inverse of a complex Hermitian matrix. Syntax Fortran 77: call chetri( uplo, n, a, lda, ipiv, work, info ) call zhetri( uplo, n, a, lda, ipiv, work, info ) Fortran 95: call hetri( a, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hetri( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a complex Hermitian matrix A. Before calling this routine, call ? hetrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the Bunch-Kaufman factorization A = P*U*D*UH*PT. If uplo = 'L', the array a stores the Bunch-Kaufman factorization A = P*L*D*LH*PT. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri 3 Intel® Math Kernel Library Reference Manual 522 DOUBLE COMPLEX for zhetri. Arrays: a(lda,*) contains the factorization of the matrix A, as returned by ? hetrf. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hetrf. Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UH*PT*X*P*U - I| = c(n)e(|D||UH|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LH*PT*X*P*L - I| = c(n)e(|D||LH|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. The real counterpart of this routine is ?sytri. ?sytri2 Computes the inverse of a symmetric indefinite matrix through setting the leading dimension of the workspace and calling ?sytri2x. LAPACK Routines: Linear Equations 3 523 Syntax Fortran 77: call ssytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call dsytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call csytri2( uplo, n, a, lda, ipiv, work, lwork, info ) call zsytri2( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call sytri2( a,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_sytri2( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric indefinite matrix A using the factorization A = U*D*UT or A = L*D*LT computed by ?sytrf. The ?sytri2 routine sets the leading dimension of the workspace before calling ?sytri2x that actually computes the inverse. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UT. If uplo = 'L', the array a stores the factorization A = L*D*LT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri2 DOUBLE PRECISION for dsytri2 COMPLEX for csytri2 DOUBLE COMPLEX for zsytri2 Arrays: a(lda,*) contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of (n+nb+1)*(nb+3) dimension. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D as returned by ?sytrf. 3 Intel® Math Kernel Library Reference Manual 524 lwork INTEGER. The dimension of the work array. lwork = (n+nb+1)*(nb+3) where nb is the block size parameter as returned by sytrf. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, D(i,i) = 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri2 interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Indicates how the matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?sytri2x ?hetri2 Computes the inverse of a Hermitian indefinite matrix through setting the leading dimension of the workspace and calling ?hetri2x. Syntax Fortran 77: call chetri2( uplo, n, a, lda, ipiv, work, lwork, info ) call zhetri2( uplo, n, a, lda, ipiv, work, lwork, info ) Fortran 95: call hetri2( a,ipiv[,uplo][,info] ) C: lapack_int LAPACKE_hetri2( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv ); LAPACK Routines: Linear Equations 3 525 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a Hermitian indefinite matrix A using the factorization A = U*D*UH or A = L*D*LH computed by ?hetrf. The ?hetri2 routine sets the leading dimension of the workspace before calling ?hetri2x that actually computes the inverse. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UH. If uplo = 'L', the array a stores the factorization A = L*D*LH. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri2 DOUBLE COMPLEX for zhetri2 Arrays: a(lda,*) contains the block diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of (n+nb+1)*(nb+3) dimension. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the block structure of D as returned by ?hetrf. lwork INTEGER. The dimension of the work array. lwork = (n+nb+1)*(nb+3) where nb is the block size parameter as returned by hetrf. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a If info = 0, the inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. 3 Intel® Math Kernel Library Reference Manual 526 If info =-i, the i-th parameter had an illegal value. If info = i, D(i,i) = 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri2 interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?hetrf ?hetri2x ?sytri2x Computes the inverse of a symmetric indefinite matrix after ?sytri2 sets the leading dimension of the workspace. Syntax Fortran 77: call ssytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call dsytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call csytri2x( uplo, n, a, lda, ipiv, work, nb, info ) call zsytri2x( uplo, n, a, lda, ipiv, work, nb, info ) Fortran 95: call sytri2x( a,ipiv,nb[,uplo][,info] ) C: lapack_int LAPACKE_sytri2x( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv, lapack_int nb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a symmetric indefinite matrix A using the factorization A = U*D*UT or A = L*D*LT computed by ?sytrf. The ?sytri2x actually computes the inverse after the ?sytri2 routine sets the leading dimension of the workspace before calling ?sytri2x. LAPACK Routines: Linear Equations 3 527 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UT. If uplo = 'L', the array a stores the factorization A = L*D*LT. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssytri2x DOUBLE PRECISION for dsytri2x COMPLEX for csytri2x DOUBLE COMPLEX for zsytri2x Arrays: a(lda,*) contains the nb (block size) diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?sytrf. The second dimension of a must be at least max(1,n). work is a workspace array of the dimension (n+nb+1)*(nb+3) where nb is the block size as set by ?sytrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the nb structure of D as returned by ? sytrf. nb INTEGER. Block size. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, Dii= 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytri2x interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. nb Holds the block size. 3 Intel® Math Kernel Library Reference Manual 528 uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?sytrf ?sytri2 ?hetri2x Computes the inverse of a Hermitian indefinite matrix after ?hetri2 sets the leading dimension of the workspace. Syntax Fortran 77: call chetri2x( uplo, n, a, lda, ipiv, work, nb, info ) call zhetri2x( uplo, n, a, lda, ipiv, work, nb, info ) Fortran 95: call hetri2x( a,ipiv,nb[,uplo][,info] ) C: lapack_int LAPACKE_hetri2x( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const lapack_int* ipiv,lapack_int nb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a Hermitian indefinite matrix A using the factorization A = U*D*UH or A = L*D*LH computed by ?hetrf. The ?hetri2x actually computes the inverse after the ?hetri2 routine sets the leading dimension of the workspace before calling ?hetri2x. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array a stores the factorization A = U*D*UH. If uplo = 'L', the array a stores the factorization A = L*D*LH. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for chetri2x DOUBLE COMPLEX for zhetri2x Arrays: a(lda,*) contains the nb (block size) diagonal matrix D and the multipliers used to obtain the factor U or L as returned by ?hetrf. LAPACK Routines: Linear Equations 3 529 The second dimension of a must be at least max(1,n). work is a workspace array of the dimension (n+nb+1)*(nb+3) where nb is the block size as set by ?hetrf. lda INTEGER. The leading dimension of a; lda = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). Details of the interchanges and the nb structure of D as returned by ? hetrf. nb INTEGER. Block size. Output Parameters a If info = 0, the symmetric inverse of the original matrix. If info = 'U', the upper triangular part of the inverse is formed and the part of A below the diagonal is not referenced. If info = 'L', the lower triangular part of the inverse is formed and the part of A above the diagonal is not referenced. info INTEGER. If info = 0, the execution is successful. If info =-i, the i-th parameter had an illegal value. If info = i, Dii= 0; D is singular and its inversion could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetri2x interface are as follows: a Holds the matrix A of size (n,n). ipiv Holds the vector of length n. nb Holds the block size. uplo Indicates how the input matrix A has been factored. Must be 'U' or 'L'. See Also ?hetrf ?hetri2 ?sptri Computes the inverse of a symmetric matrix using packed storage. Syntax Fortran 77: call ssptri( uplo, n, ap, ipiv, work, info ) call dsptri( uplo, n, ap, ipiv, work, info ) call csptri( uplo, n, ap, ipiv, work, info ) call zsptri( uplo, n, ap, ipiv, work, info ) 3 Intel® Math Kernel Library Reference Manual 530 Fortran 95: call sptri( ap, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_sptri( int matrix_order, char uplo, lapack_int n, * ap, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a packed symmetric matrix A. Before calling this routine, call ? sptrf to factorize A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the Bunch-Kaufman factorization A = P*U*D*UT*PT. If uplo = 'L', the array ap stores the Bunch-Kaufman factorization A = P*L*D*LT*PT. n INTEGER. The order of the matrix A; n = 0. ap, work REAL for ssptri DOUBLE PRECISION for dsptri COMPLEX for csptri DOUBLE COMPLEX for zsptri. Arrays: ap(*) contains the factorization of the matrix A, as returned by ? sptrf. The dimension of ap must be at least max(1,n(n+1)/2). work(*) is a workspace array. The dimension of work must be at least max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?sptrf. Output Parameters ap Overwritten by the n-by-n matrix inv(A) in packed form. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. LAPACK Routines: Linear Equations 3 531 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UT*PT*X*P*U - I| = c(n)e(|D||UT|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LT*PT*X*P*L - I| = c(n)e(|D||LT|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. ?hptri Computes the inverse of a complex Hermitian matrix using packed storage. Syntax Fortran 77: call chptri( uplo, n, ap, ipiv, work, info ) call zhptri( uplo, n, ap, ipiv, work, info ) Fortran 95: call hptri( ap, ipiv [,uplo] [,info] ) C: lapack_int LAPACKE_hptri( int matrix_order, char uplo, lapack_int n, * ap, const lapack_int* ipiv ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a complex Hermitian matrix A using packed storage. Before calling this routine, call ?hptrf to factorize A. 3 Intel® Math Kernel Library Reference Manual 532 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates how the input matrix A has been factored: If uplo = 'U', the array ap stores the packed Bunch-Kaufman factorization A = P*U*D*UH*PT. If uplo = 'L', the array ap stores the packed Bunch-Kaufman factorization A = P*L*D*LH*PT. n INTEGER. The order of the matrix A; n = 0. ap, work COMPLEX for chptri DOUBLE COMPLEX for zhptri. Arrays: ap(*) contains the factorization of the matrix A, as returned by ? hptrf. The dimension of ap must be at least max(1,n(n+1)/2). work(*) is a workspace array. The dimension of work must be at least max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The ipiv array, as returned by ?hptrf. Output Parameters ap Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of D is zero, D is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed inverse X satisfies the following error bounds: |D*UH*PT*X*P*U - I| = c(n)e(|D||UH|PT|X|P|U| + |D||D-1|) for uplo = 'U', and |D*LH*PT*X*PL - I| = c(n)e(|D||LH|PT|X|P|L| + |D||D-1|) for uplo = 'L'. Here c(n) is a modest linear function of n, and e is the machine precision; I denotes the identity matrix. LAPACK Routines: Linear Equations 3 533 The total number of floating-point operations is approximately (2/3)n3 for real flavors and (8/3)n3 for complex flavors. The real counterpart of this routine is ?sptri. ?trtri Computes the inverse of a triangular matrix. Syntax Fortran 77: call strtri( uplo, diag, n, a, lda, info ) call dtrtri( uplo, diag, n, a, lda, info ) call ctrtri( uplo, diag, n, a, lda, info ) call ztrtri( uplo, diag, n, a, lda, info ) Fortran 95: call trtri( a [,uplo] [,diag] [,info] ) C: lapack_int LAPACKE_trtri( int matrix_order, char uplo, char diag, lapack_int n, * a, lapack_int lda ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a triangular matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a REAL for strtri DOUBLE PRECISION for dtrtri COMPLEX for ctrtri DOUBLE COMPLEX for ztrtri. Array: DIMENSION (,*). 3 Intel® Math Kernel Library Reference Manual 534 Contains the matrix A. The second dimension of a must be at least max(1,n). lda INTEGER. The first dimension of a; lda = max(1, n). Output Parameters a Overwritten by the n-by-n matrix inv(A). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is zero, A is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine trtri interface are as follows: a Holds the matrix A of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed inverse X satisfies the following error bounds: |XA - I| = c(n)e |X||A| |XA - I| = c(n)e |A-1||A||X|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. ?tftri Computes the inverse of a triangular matrix stored in the Rectangular Full Packed (RFP) format. Syntax Fortran 77: call stftri( transr, uplo, diag, n, a, info ) call dtftri( transr, uplo, diag, n, a, info ) call ctftri( transr, uplo, diag, n, a, info ) call ztftri( transr, uplo, diag, n, a, info ) C: lapack_int LAPACKE_tftri( int matrix_order, char transr, char uplo, char diag, lapack_int n, * a ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h LAPACK Routines: Linear Equations 3 535 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Computes the inverse of a triangular matrix A stored in the Rectangular Full Packed (RFP) format. For the description of the RFP format, see Matrix Storage Schemes. This is the block version of the algorithm, calling Level 3 BLAS. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. transr CHARACTER*1. Must be 'N', 'T' (for real data) or 'C' (for complex data). If transr = 'N', the Normal transr of RFP A is stored. If transr = 'T', the Transpose transr of RFP A is stored. If transr = 'C', the Conjugate-Transpose transr of RFP A is stored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of RFP A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array a. n INTEGER. The order of the matrix A; n = 0. a REAL for stftri DOUBLE PRECISION for dtftri COMPLEX for ctftri DOUBLE COMPLEX for ztftri. Array, DIMENSION (n*(n+1)/2). The array a contains the matrix A in the RFP format. Output Parameters a The (triangular) inverse of the original matrix in the same storage format. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, A(i,i) is exactly zero. The triangular matrix is singular and its inverse cannot be computed. ?tptri Computes the inverse of a triangular matrix using packed storage. 3 Intel® Math Kernel Library Reference Manual 536 Syntax Fortran 77: call stptri( uplo, diag, n, ap, info ) call dtptri( uplo, diag, n, ap, info ) call ctptri( uplo, diag, n, ap, info ) call ztptri( uplo, diag, n, ap, info ) Fortran 95: call tptri( ap [,uplo] [,diag] [,info] ) C: lapack_int LAPACKE_tptri( int matrix_order, char uplo, char diag, lapack_int n, * ap ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the inverse inv(A) of a packed triangular matrix A. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether A is upper or lower triangular: If uplo = 'U', then A is upper triangular. If uplo = 'L', then A is lower triangular. diag CHARACTER*1. Must be 'N' or 'U'. If diag = 'N', then A is not a unit triangular matrix. If diag = 'U', A is unit triangular: diagonal elements of A are assumed to be 1 and not referenced in the array ap. n INTEGER. The order of the matrix A; n = 0. ap REAL for stptri DOUBLE PRECISION for dtptri COMPLEX for ctptri DOUBLE COMPLEX for ztptri. Array, DIMENSION at least max(1,n(n+1)/2). Contains the packed triangular matrix A. Output Parameters ap Overwritten by the packed n-by-n matrix inv(A) . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 537 If info = i, the i-th diagonal element of A is zero, A is singular, and the inversion could not be completed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine tptri interface are as follows: ap Holds the array A of size (n*(n+1)/2). uplo Must be 'U' or 'L'. The default value is 'U'. diag Must be 'N' or 'U'. The default value is 'N'. Application Notes The computed inverse X satisfies the following error bounds: |XA - I| = c(n)e |X||A| |X - A-1| = c(n)e |A-1||A||X|, where c(n) is a modest linear function of n; e is the machine precision; I denotes the identity matrix. The total number of floating-point operations is approximately (1/3)n3 for real flavors and (4/3)n3 for complex flavors. Routines for Matrix Equilibration Routines described in this section are used to compute scaling factors needed to equilibrate a matrix. Note that these routines do not actually scale the matrices. ?geequ Computes row and column scaling factors intended to equilibrate a general matrix and reduce its condition number. Syntax Fortran 77: call sgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call dgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call cgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call zgeequ( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) Fortran 95: call geequ( a, r, c [,rowcnd] [,colcnd] [,amax] [,info] ) C: lapack_int LAPACKE_sgeequ( int matrix_order, lapack_int m, lapack_int n, const float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgeequ( int matrix_order, lapack_int m, lapack_int n, const double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); 3 Intel® Math Kernel Library Reference Manual 538 lapack_int LAPACKE_cgeequ( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgeequ( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n matrix A and reduce its condition number. The output array r returns the row scale factors and the array c the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements bij=r(i)*aij*c(j) have absolute value 1. See ?laqge auxiliary function that uses scaling factors computed by ?geequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. a REAL for sgeequ DOUBLE PRECISION for dgeequ COMPLEX for cgeequ DOUBLE COMPLEX for zgeequ. Array: DIMENSION (lda,*). Contains the m-by-n matrix A whose equilibration factors are to be computed. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors of the matrix A. If info = 0, the array c contains the column scale factors of the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). colcnd REAL for single precision flavors LAPACK Routines: Linear Equations 3 539 DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)th column of A is exactly zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine geequ interface are as follows: a Holds the matrix A of size (m, n). r Holds the vector of length (m). c Holds the vector of length n. Application Notes All the components of r and c are restricted to be between SMLNUM = smallest safe number and BIGNUM= largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of A but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM If rowcnd = 0.1 and amax is neither too large nor too small, it is not worth scaling by r. If colcnd = 0.1, it is not worth scaling by c. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?geequb Computes row and column scaling factors restricted to a power of radix to equilibrate a general matrix and reduce its condition number. Syntax Fortran 77: call sgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call dgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call cgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) call zgeequb( m, n, a, lda, r, c, rowcnd, colcnd, amax, info ) 3 Intel® Math Kernel Library Reference Manual 540 C: lapack_int LAPACKE_sgeequb( int matrix_order, lapack_int m, lapack_int n, const float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgeequb( int matrix_order, lapack_int m, lapack_int n, const double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgeequb( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgeequb( int matrix_order, lapack_int m, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n general matrix A and reduce its condition number. The output array r returns the row scale factors and the array c - the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements b(ij)=r(i)*a(ij)*c(j) have an absolute value of at most the radix. r(i) and c(j) are restricted to be a power of the radix between SMLNUM = smallest safe number and BIGNUM = largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of a but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM This routine differs from ?geequ by restricting the scaling factors to a power of the radix. Except for overand underflow, scaling by these factors introduces no additional rounding errors. However, the scaled entries' magnitudes are no longer equal to approximately 1 but lie between sqrt(radix) and 1/sqrt(radix). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. a REAL for sgeequb DOUBLE PRECISION for dgeequb COMPLEX for cgeequb DOUBLE COMPLEX for zgeequb. Array: DIMENSION (lda,*). LAPACK Routines: Linear Equations 3 541 Contains the m-by-n matrix A whose equilibration factors are to be computed. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors for the matrix A. If info = 0, the array c contains the column scale factors for the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). If rowcnd = 0.1, and amax is neither too large nor too small, it is not worth scaling by r. colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). If colcnd = 0.1, it is not worth scaling by c. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or very close to underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)-th column of A is exactly zero. ?gbequ Computes row and column scaling factors intended to equilibrate a banded matrix and reduce its condition number. Syntax Fortran 77: call sgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call dgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call cgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call zgbequ( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) Fortran 95: call gbequ( ab, r, c [,kl] [,rowcnd] [,colcnd] [,amax] [,info] ) 3 Intel® Math Kernel Library Reference Manual 542 C: lapack_int LAPACKE_sgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgbequ( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n band matrix A and reduce its condition number. The output array r returns the row scale factors and the array c the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements bij=r(i)*aij*c(j) have absolute value 1. See ?laqgb auxiliary function that uses scaling factors computed by ?gbequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbequ DOUBLE PRECISION for dgbequ COMPLEX for cgbequ DOUBLE COMPLEX for zgbequ. Array, DIMENSION (ldab,*). Contains the original band matrix A stored in rows from 1 to kl + ku + 1. The second dimension of ab must be at least max(1,n). ldab INTEGER. The leading dimension of ab; ldab = kl+ku+1. Output Parameters r, c REAL for single precision flavors LAPACK Routines: Linear Equations 3 543 DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors of the matrix A. If info = 0, the array c contains the column scale factors of the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i and i = m, the i-th row of A is exactly zero; i > m, the (i-m)th column of A is exactly zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbequ interface are as follows: ab Holds the array A of size (kl+ku+1,n). r Holds the vector of length (m). c Holds the vector of length n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. Application Notes All the components of r and c are restricted to be between SMLNUM = smallest safe number and BIGNUM= largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of A but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM If rowcnd = 0.1 and amax is neither too large nor too small, it is not worth scaling by r. If colcnd = 0.1, it is not worth scaling by c. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. 3 Intel® Math Kernel Library Reference Manual 544 ?gbequb Computes row and column scaling factors restricted to a power of radix to equilibrate a banded matrix and reduce its condition number. Syntax Fortran 77: call sgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call dgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call cgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) call zgbequb( m, n, kl, ku, ab, ldab, r, c, rowcnd, colcnd, amax, info ) C: lapack_int LAPACKE_sgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_dgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); lapack_int LAPACKE_cgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_float* ab, lapack_int ldab, float* r, float* c, float* rowcnd, float* colcnd, float* amax ); lapack_int LAPACKE_zgbequb( int matrix_order, lapack_int m, lapack_int n, lapack_int kl, lapack_int ku, const lapack_complex_double* ab, lapack_int ldab, double* r, double* c, double* rowcnd, double* colcnd, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate an m-by-n banded matrix A and reduce its condition number. The output array r returns the row scale factors and the array c - the column scale factors. These factors are chosen to try to make the largest element in each row and column of the matrix B with elements b(ij)=r(i)*a(ij)*c(j) have an absolute value of at most the radix. r(i) and c(j) are restricted to be a power of the radix between SMLNUM = smallest safe number and BIGNUM = largest safe number. Use of these scaling factors is not guaranteed to reduce the condition number of a but works well in practice. SMLNUM and BIGNUM are parameters representing machine precision. You can use the ?lamch routines to compute them. For example, compute single precision (real and complex) values of SMLNUM and BIGNUM as follows: SMLNUM = slamch ('s') BIGNUM = 1 / SMLNUM This routine differs from ?gbequ by restricting the scaling factors to a power of the radix. Except for overand underflow, scaling by these factors introduces no additional rounding errors. However, the scaled entries' magnitudes are no longer equal to approximately 1 but lie between sqrt(radix) and 1/sqrt(radix). LAPACK Routines: Linear Equations 3 545 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A; m = 0. n INTEGER. The number of columns of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. ab REAL for sgbequb DOUBLE PRECISION for dgbequb COMPLEX for cgbequb DOUBLE COMPLEX for zgbequb. Array: DIMENSION (ldab,*). Contains the original banded matrix A stored in rows from 1 to kl + ku + 1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = a(i,j) for max(1,j-ku) = i = min(n,j+kl). The second dimension of ab must be at least max(1,n). ldab INTEGER. The leading dimension of a; ldab = max(1, m). Output Parameters r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(m), c(n). If info = 0, or info > m, the array r contains the row scale factors for the matrix A. If info = 0, the array c contains the column scale factors for the matrix A. rowcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0 or info > m, rowcnd contains the ratio of the smallest r(i) to the largest r(i). If rowcnd = 0.1, and amax is neither too large nor too small, it is not worth scaling by r. colcnd REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, colcnd contains the ratio of the smallest c(i) to the largest c(i). If colcnd = 0.1, it is not worth scaling by c. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. i = m, the i-th row of A is exactly zero; i > m, the (i-m)-th column of A is exactly zero. 3 Intel® Math Kernel Library Reference Manual 546 ?poequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix and reduce its condition number. Syntax Fortran 77: call spoequ( n, a, lda, s, scond, amax, info ) call dpoequ( n, a, lda, s, scond, amax, info ) call cpoequ( n, a, lda, s, scond, amax, info ) call zpoequ( n, a, lda, s, scond, amax, info ) Fortran 95: call poequ( a, s [,scond] [,amax] [,info] ) C: lapack_int LAPACKE_spoequ( int matrix_order, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpoequ( int matrix_order, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpoequ( int matrix_order, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zpoequ( int matrix_order, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positivedefinite matrix A and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsy auxiliary function that uses scaling factors computed by ?poequ. LAPACK Routines: Linear Equations 3 547 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a REAL for spoequ DOUBLE PRECISION for dpoequ COMPLEX for cpoequ DOUBLE COMPLEX for zpoequ. Array: DIMENSION (lda,*). Contains the n-by-n symmetric or Hermitian positive definite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1,n). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine poequ interface are as follows: a Holds the matrix A of size (n,n). s Holds the vector of length n. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. 3 Intel® Math Kernel Library Reference Manual 548 ?poequb Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix and reduce its condition number. Syntax Fortran 77: call spoequb( n, a, lda, s, scond, amax, info ) call dpoequb( n, a, lda, s, scond, amax, info ) call cpoequb( n, a, lda, s, scond, amax, info ) call zpoequb( n, a, lda, s, scond, amax, info ) C: lapack_int LAPACKE_spoequb( int matrix_order, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpoequb( int matrix_order, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpoequb( int matrix_order, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zpoequb( int matrix_order, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positivedefinite matrix A and reduce its condition number (with respect to the two-norm). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has diagonal elements equal to 1. s(i) is a power of two nearest to, but not exceeding 1/sqrt(A(i,i)). This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix A; n = 0. a REAL for spoequb DOUBLE PRECISION for dpoequb COMPLEX for cpoequb DOUBLE COMPLEX for zpoequb. Array: DIMENSION (lda,*). LAPACK Routines: Linear Equations 3 549 Contains the n-by-n symmetric or Hermitian positive definite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. ?ppequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive definite matrix in packed storage and reduce its condition number. Syntax Fortran 77: call sppequ( uplo, n, ap, s, scond, amax, info ) call dppequ( uplo, n, ap, s, scond, amax, info ) call cppequ( uplo, n, ap, s, scond, amax, info ) call zppequ( uplo, n, ap, s, scond, amax, info ) Fortran 95: call ppequ( ap, s [,scond] [,amax] [,uplo] [,info] ) C: lapack_int LAPACKE_sppequ( int matrix_order, char uplo, lapack_int n, const float* ap, float* s, float* scond, float* amax ); lapack_int LAPACKE_dppequ( int matrix_order, char uplo, lapack_int n, const double* ap, double* s, double* scond, double* amax ); lapack_int LAPACKE_cppequ( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* ap, float* s, float* scond, float* amax ); 3 Intel® Math Kernel Library Reference Manual 550 lapack_int LAPACKE_zppequ( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* ap, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positive definite matrix A in packed storage and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsp auxiliary function that uses scaling factors computed by ?ppequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ap: If uplo = 'U', the array ap stores the upper triangular part of the matrix A. If uplo = 'L', the array ap stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. ap REAL for sppequ DOUBLE PRECISION for dppequ COMPLEX for cppequ DOUBLE COMPLEX for zppequ. Array, DIMENSION at least max(1,n(n+1)/2). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. LAPACK Routines: Linear Equations 3 551 scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppequ interface are as follows: ap Holds the array A of size (n*(n+1)/2). s Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?pbequ Computes row and column scaling factors intended to equilibrate a symmetric (Hermitian) positive-definite band matrix and reduce its condition number. Syntax Fortran 77: call spbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call dpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call cpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) call zpbequ( uplo, n, kd, ab, ldab, s, scond, amax, info ) Fortran 95: call pbequ( ab, s [,scond] [,amax] [,uplo] [,info] ) C: lapack_int LAPACKE_spbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const float* ab, lapack_int ldab, float* s, float* scond, float* amax ); lapack_int LAPACKE_dpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const double* ab, lapack_int ldab, double* s, double* scond, double* amax ); lapack_int LAPACKE_cpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_float* ab, lapack_int ldab, float* s, float* scond, float* amax ); 3 Intel® Math Kernel Library Reference Manual 552 lapack_int LAPACKE_zpbequ( int matrix_order, char uplo, lapack_int n, lapack_int kd, const lapack_complex_double* ab, lapack_int ldab, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric (Hermitian) positive definite matrix A in packed storage and reduce its condition number (with respect to the two-norm). The output array s returns scale factors computed as These factors are chosen so that the scaled matrix B with elements bij=s(i)*aij*s(j) has diagonal elements equal to 1. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. See ?laqsb auxiliary function that uses scaling factors computed by ?pbequ. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is packed in the array ab: If uplo = 'U', the array ab stores the upper triangular part of the matrix A. If uplo = 'L', the array ab stores the lower triangular part of the matrix A. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. ab REAL for spbequ DOUBLE PRECISION for dpbequ COMPLEX for cpbequ DOUBLE COMPLEX for zpbequ. Array, DIMENSION (ldab,*). The array ap contains either the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. LAPACK Routines: Linear Equations 3 553 Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbequ interface are as follows: ab Holds the array A of size (kd+1,n). s Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes If scond = 0.1 and amax is neither too large nor too small, it is not worth scaling by s. If amax is very close to overflow or very close to underflow, the matrix A should be scaled. ?syequb Computes row and column scaling factors intended to equilibrate a symmetric indefinite matrix and reduce its condition number. Syntax Fortran 77: call ssyequb( uplo, n, a, lda, s, scond, amax, work, info ) call dsyequb( uplo, n, a, lda, s, scond, amax, work, info ) call csyequb( uplo, n, a, lda, s, scond, amax, work, info ) call zsyequb( uplo, n, a, lda, s, scond, amax, work, info ) C: lapack_int LAPACKE_ssyequb( int matrix_order, char uplo, lapack_int n, const float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_dsyequb( int matrix_order, char uplo, lapack_int n, const double* a, lapack_int lda, double* s, double* scond, double* amax ); 3 Intel® Math Kernel Library Reference Manual 554 lapack_int LAPACKE_csyequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zsyequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a symmetric indefinite matrix A and reduce its condition number (with respect to the two-norm). The array s contains the scale factors, s(i) = 1/sqrt(A(i,i)). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has ones on the diagonal. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a, work REAL for ssyequb DOUBLE PRECISION for dsyequb COMPLEX for csyequb DOUBLE COMPLEX for zsyequb. Array a: DIMENSION (lda,*). Contains the n-by-n symmetric indefinite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work is at least max(1,3*n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for single precision flavors LAPACK Routines: Linear Equations 3 555 DOUBLE PRECISION for double precision flavors. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. ?heequb Computes row and column scaling factors intended to equilibrate a Hermitian indefinite matrix and reduce its condition number. Syntax Fortran 77: call cheequb( uplo, n, a, lda, s, scond, amax, work, info ) call zheequb( uplo, n, a, lda, s, scond, amax, work, info ) C: lapack_int LAPACKE_cheequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_float* a, lapack_int lda, float* s, float* scond, float* amax ); lapack_int LAPACKE_zheequb( int matrix_order, char uplo, lapack_int n, const lapack_complex_double* a, lapack_int lda, double* s, double* scond, double* amax ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes row and column scalings intended to equilibrate a Hermitian indefinite matrix A and reduce its condition number (with respect to the two-norm). The array s contains the scale factors, s(i) = 1/sqrt(A(i,i)). These factors are chosen so that the scaled matrix B with elements b(i,j)=s(i)*a(i,j)*s(j) has ones on the diagonal. This choice of s puts the condition number of B within a factor n of the smallest possible condition number over all possible diagonal scalings. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: 3 Intel® Math Kernel Library Reference Manual 556 If uplo = 'U', the array a stores the upper triangular part of the matrix A. If uplo = 'L', the array a stores the lower triangular part of the matrix A. n INTEGER. The order of the matrix A; n = 0. a, work COMPLEX for cheequb DOUBLE COMPLEX for zheequb. Array a: DIMENSION (lda,*). Contains the n-by-n symmetric indefinite matrix A whose scaling factors are to be computed. Only the diagonal elements of A are referenced. The second dimension of a must be at least max(1,n). work(*) is a workspace array. The dimension of work is at least max(1,3*n). lda INTEGER. The leading dimension of a; lda = max(1, m). Output Parameters s REAL for cheequb DOUBLE PRECISION for zheequb. Array, DIMENSION (n). If info = 0, the array s contains the scale factors for A. scond REAL for cheequb DOUBLE PRECISION for zheequb. If info = 0, scond contains the ratio of the smallest s(i) to the largest s(i). If scond = 0.1, and amax is neither too large nor too small, it is not worth scaling by s. amax REAL for cheequb DOUBLE PRECISION for zheequb. Absolute value of the largest element of the matrix A. If amax is very close to overflow or underflow, the matrix should be scaled. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the i-th diagonal element of A is nonpositive. Driver Routines Table "Driver Routines for Solving Systems of Linear Equations" lists the LAPACK driver routines for solving systems of linear equations with real or complex matrices. Driver Routines for Solving Systems of Linear Equations Matrix type, storage scheme Simple Driver Expert Driver Expert Driver using Extra-Precise Interative Refinement general ?gesv ?gesvx ?gesvxx general band ?gbsv ?gbsvx ?gbsvxx general tridiagonal ?gtsv ?gtsvx LAPACK Routines: Linear Equations 3 557 Matrix type, storage scheme Simple Driver Expert Driver Expert Driver using Extra-Precise Interative Refinement diagonally dominant tridiagonal ?dtsvb symmetric/Hermitian positive-definite ?posv ?posvx ?posvxx symmetric/Hermitian positive-definite, storage ?ppsv ?ppsvx symmetric/Hermitian positive-definite, band ?pbsv ?pbsvx symmetric/Hermitian positive-definite, tridiagonal ?ptsv ?ptsvx symmetric/Hermitian indefinite ?sysv/?hesv ?sysvx/?hesvx ?sysvxx/?hesvxx symmetric/Hermitian indefinite, packed storage ?spsv/?hpsv ?spsvx/?hpsvx complex symmetric ?sysv ?sysvx complex symmetric, packed storage ?spsv ?spsvx In this table ? stands for s (single precision real), d (double precision real), c (single precision complex), or z (double precision complex). In the description of ?gesv and ?posv routines, the ? sign stands for combined character codes ds and zc for the mixed precision subroutines. ?gesv Computes the solution to the system of linear equations with a square matrix A and multiple righthand sides. Syntax Fortran 77: call sgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call dgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call cgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call zgesv( n, nrhs, a, lda, ipiv, b, ldb, info ) call dsgesv( n, nrhs, a, lda, ipiv, b, ldb, x, ldx, work, swork, iter, info ) call zcgesv( n, nrhs, a, lda, ipiv, b, ldb, x, ldx, work, swork, rwork, iter, info ) Fortran 95: call gesv( a, b [,ipiv] [,info] ) 3 Intel® Math Kernel Library Reference Manual 558 C: lapack_int LAPACKE_gesv( int matrix_order, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); lapack_int LAPACKE_dsgesv( int matrix_order, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, lapack_int* ipiv, double* b, lapack_int ldb, double* x, lapack_int ldx, lapack_int* iter ); lapack_int LAPACKE_zcgesv( int matrix_order, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_int* ipiv, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, lapack_int* iter ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The LU decomposition with partial pivoting and row interchanges is used to factor A as A = P*L*U, where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. The factored form of A is then used to solve the system of equations A*X = B. The dsgesv and zcgesv are mixed precision iterative refinement subroutines for exploiting fast single precision hardware. They first attempt to factorize the matrix in single precision (dsgesv) or single complex precision (zcgesv) and use this factorization within an iterative refinement procedure to produce a solution with double precision (dsgesv) / double complex precision (zcgesv) normwise backward error quality (see below). If the approach fails, the method switches to a double precision or double complex precision factorization respectively and computes the solution. The iterative refinement is not going to be a winning strategy if the ratio single precision performance over double precision performance is too small. A reasonable strategy should take the number of right-hand sides and the size of the matrix into account. This might be done with a call to ilaenv in the future. At present, iterative refinement is implemented. The iterative refinement process is stopped if iter > itermax or for all the right-hand sides: rnmr < sqrt(n)*xnrm*anrm*eps*bwdmax where • iter is the number of the current iteration in the iterativerefinement process • rnmr is the infinity-norm of the residual • xnrm is the infinity-norm of the solution • anrm is the infinity-operator-norm of the matrix A • eps is the machine epsilon returned by dlamch (‘Epsilon’). The values itermax and bwdmax are fixed to 30 and 1.0d+00 respectively. LAPACK Routines: Linear Equations 3 559 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The number of linear equations, that is, the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, that is, the number of columns of the matrix B; nrhs = 0. a, b REAL for sgesv DOUBLE PRECISION for dgesv and dsgesv COMPLEX for cgesv DOUBLE COMPLEX for zgesv and zcgesv. Arrays: a(lda,*), b(ldb,*). The array a contains the n-by-n coefficient matrix A. The array b contains the n-by-nrhs matrix of right hand side matrix B. The second dimension of a must be at least max(1, n), the second dimension of b at least max(1,nrhs). lda INTEGER. The leading dimension of the array a; lda = max(1, n). ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the array x; ldx = max(1, n). work DOUBLE PRECISION for dsgesv DOUBLE COMPLEX for zcgesv. Workspace array, DIMENSION at least max(1,n*nrhs). This array is used to hold the residual vectors. swork REAL for dsgesv COMPLEX for zcgesv. Workspace array, DIMENSION at least max(1,n*(n+nrhs)). This array is used to use the single precision matrix and the right-hand sides or solutions in single precision. rwork DOUBLE PRECISION. Workspace array, DIMENSION at least max(1,n). Output Parameters a Overwritten by the factors L and U from the factorization of A = P*L*U; the unit diagonal elements of L are not stored. If iterative refinement has been successfully used (info= 0 and iter= 0), then A is unchanged. If double precision factorization has been used (info= 0 and iter < 0), then the array A contains the factors L and U from the factorization A = P*L*U; the unit diagonal elements of L are not stored. b Overwritten by the solution matrix X for dgesv, sgesv,zgesv,zgesv. Unchanged for dsgesv and zcgesv. ipiv INTEGER. Array, DIMENSION at least max(1, n). The pivot indices that define the permutation matrix P; row i of the matrix was interchanged with row ipiv(i). Corresponds to the single precision factorization (if info= 0 and iter = 0) or the double precision factorization (if info= 0 and iter < 0). x DOUBLE PRECISION for dsgesv 3 Intel® Math Kernel Library Reference Manual 560 DOUBLE COMPLEX for zcgesv. Array, DIMENSION (ldx, nrhs). If info = 0, contains the n-by-nrhs solution matrix X. iter INTEGER. If iter < 0: iterative refinement has failed, double precision factorization has been performed • If iter = -1: the routine fell back to full precision for implementation- or machine-specific reason • If iter = -2: narrowing the precision induced an overflow, the routine fell back to full precision • If iter = -3: failure of sgetrf for dsgesv, or cgetrf for zcgesv • If iter = -31: stop the iterative refinement after the 30th iteration. If iter > 0: iterative refinement has been successfully used. Returns the number of iterations. info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, U(i, i) (computed in double precision for mixed precision subroutines) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. NOTE Fortran 95 Interface is so far not available for the mixed precision subroutines dsgesv/zcgesv. See Also ilaenv ?lamch ?getrf ?gesvx Computes the solution to the system of linear equations with a square matrix A and multiple righthand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 561 call dgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgesvx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gesvx( a, b, x [,af] [,ipiv] [,fact] [,trans] [,equed] [,r] [,c] [,ferr] [,berr] [,rcond] [,rpvgrw] [,info] ) C: lapack_int LAPACKE_sgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_dgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); lapack_int LAPACKE_cgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_zgesvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gesvx performs the following steps: 1. If fact = 'E', real scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B 3 Intel® Math Kernel Library Reference Manual 562 Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface, except for rpivot. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to af and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Transpose for real flavors, conjugate transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A. If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). LAPACK Routines: Linear Equations 3 563 The array af is an input argument if fact = 'F'. It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. If equed is not 'N', then af is the factored form of the equilibrated matrix A. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?getrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 564 Workspace array, DIMENSION at least max(1, 2*n); used in complex flavors only. Output Parameters x REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: diag(C)-1*X, if trans = 'N' and equed = 'C' or 'B'; diag(R)-1*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(R)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(R)*A*diag(c). af If fact = 'N' or 'E', then af is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b Overwritten by diag(r)*B if trans = 'N' and equed = 'R'or 'B'; overwritten by diag(c)*B if trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x (j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 565 Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). work, rwork, rpivot On exit, work(1) for real flavors, or rwork(1) for complex flavors (the Fortran interface) and rpivot (the C interface), contains the reciprocal pivot growth factor norm(A)/norm(U). The "max absolute element" norm is used. If work(1) for real flavors, or rwork(1) for complex flavors is much less than 1, then the stability of the LU factorization of the (equilibrated) matrix A could be poor. This also means that the solution x, condition estimator rcond, and forward error bound ferr could be unreliable. If factorization fails with 0 < info = n, then work(1) for real flavors, or rwork(1) for complex flavors contains the reciprocal pivot growth factor for the leading info columns of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n+1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gesvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. r Holds the vector of length n. Default value for each element is r(i) = 1.0_WP. c Holds the vector of length n. Default value for each element is c(i) = 1.0_WP. 3 Intel® Math Kernel Library Reference Manual 566 ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. trans Must be 'N', 'C', or 'T'. The default value is 'N'. equed Must be 'N', 'B', 'C', or 'R'. The default value is 'N'. rpvgrw Real value that contains the reciprocal pivot growth factor norm(A)/ norm(U). ?gesvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a square matrix A and multiple right-hand sides Syntax Fortran 77: call sgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgesvxx( fact, trans, n, nrhs, a, lda, af, ldaf, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); LAPACK Routines: Linear Equations 3 567 lapack_int LAPACKE_zgesvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n matrix, the columns of the matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?gesvxx performs the following steps: 1. If fact = 'E', scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to improve the computed solution matrix and calculate error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. 3 Intel® Math Kernel Library Reference Manual 568 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate Transpose = Transpose for real flavors, Conjugate Transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sgesvxx DOUBLE PRECISION for dgesvxx COMPLEX for cgesvxx DOUBLE COMPLEX for zgesvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A. If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?getrf. If equed is not 'N', then af is the factored form of the equilibrated matrix A. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?getrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. LAPACK Routines: Linear Equations 3 569 equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the double-precision refinement algorithm, possibly with doubled-single computations if the compilation environment does not support DOUBLE PRECISION. (Other values are reserved for futute use.) 3 Intel® Math Kernel Library Reference Manual 570 params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for refinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgesvxx DOUBLE PRECISION for dgesvxx COMPLEX for cgesvxx DOUBLE COMPLEX for zgesvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; or inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). af If fact = 'N' or 'E', then af is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b Overwritten by diag(r)*B if trans = 'N' and equed = 'R' or 'B'; overwritten by trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. Each element of these arrays is a power of the radix. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 571 Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. In ?gesvx, this quantity is returned in work(1). berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors 3 Intel® Math Kernel Library Reference Manual 572 and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. LAPACK Routines: Linear Equations 3 573 Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gbsv Computes the solution to the system of linear equations with a band matrix A and multiple righthand sides. Syntax Fortran 77: call sgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call dgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call cgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) call zgbsv( n, kl, ku, nrhs, ab, ldab, ipiv, b, ldb, info ) Fortran 95: call gbsv( ab, b [,kl] [,ipiv] [,info] ) C: lapack_int LAPACKE_gbsv( int matrix_order, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, * ab, lapack_int ldab, lapack_int* ipiv, * b, lapack_int ldb ); 3 Intel® Math Kernel Library Reference Manual 574 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n band matrix with kl subdiagonals and ku superdiagonals, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The LU decomposition with partial pivoting and row interchanges is used to factor A as A = L*U, where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl+ku superdiagonals. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of A. The number of rows in B; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides. The number of columns in B; nrhs = 0. ab, b REAL for sgbsv DOUBLE PRECISION for dgbsv COMPLEX for cgbsv DOUBLE COMPLEX for zgbsv. Arrays: ab(ldab,*), b(ldb,*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab. (ldab = 2kl + ku +1) ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ab Overwritten by L and U. The diagonal and kl + ku superdiagonals of U are stored in the first 1 + kl + ku rows of ab. The multipliers used to form L are stored in the next kl rows. b Overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). The pivot indices: row i was interchanged with row ipiv(i). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Linear Equations 3 575 If info = i, U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbsv interface are as follows: ab Holds the array A of size (2*kl+ku+1,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-2*kl-1. ?gbsvx Computes the solution to the real or complex system of linear equations with a band matrix A and multiple right-hand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgbsvx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gbsvx( ab, b, x [,kl] [,afb] [,ipiv] [,fact] [,trans] [,equed] [,r] [,c] [,ferr] [,berr] [,rcond] [,rpvgrw] [,info] ) C: lapack_int LAPACKE_sgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_dgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); 3 Intel® Math Kernel Library Reference Manual 576 lapack_int LAPACKE_cgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr, float* rpivot ); lapack_int LAPACKE_zgbsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr, double* rpivot ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, AT*X = B, or AH*X = B, where A is a band matrix of order n with kl subdiagonals and ku superdiagonals, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gbsvx performs the following steps: 1. If fact = 'E', real scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c) *inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T *inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H *inv(diag(r))*X = diag(c)*B Whether the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = L*U, where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl+ku superdiagonals. 3. If some Ui,i = 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface, except for rpivot. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 577 fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, afb and ipiv contain the factored form of A. If equed is not 'N', the matrix A is equilibrated with scaling factors given by r and c. ab, afb, and ipiv are not modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afb and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Transpose for real flavors, conjugate transpose for complex flavors). n INTEGER. The number of linear equations, the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right hand sides, the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgesvx DOUBLE PRECISION for dgesvx COMPLEX for cgesvx DOUBLE COMPLEX for zgesvx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). If fact = 'F' and equed is not 'N', then A must have been equilibrated by the scaling factors in r and/or c. The array afb is an input argument if fact = 'F'. The second dimension of afb must be at least max(1,n). It contains the factored form of the matrix A, that is, the factors L and U from the factorization A = L*U as computed by ?gbtrf. U is stored as an upper triangular band matrix with kl + ku superdiagonals in the first 1 + kl + ku rows of afb. The multipliers used during the factorization are stored in the next kl rows. If equed is not 'N', then afb is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kl+ku+1. ldafb INTEGER. The leading dimension of afb; ldafb = 2*kl+ku+1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 578 Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = L*U as computed by ?gbtrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). if equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sgbsvx DOUBLE PRECISION for dgbsvx COMPLEX for cgbsvx DOUBLE COMPLEX for zgbsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). LAPACK Routines: Linear Equations 3 579 ab Array ab is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns details of the LU factorization of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). See the description of ab for the form of the equilibrated matrix. b Overwritten by diag(r)*b if trans = 'N' and equed = 'R' or 'B'; overwritten by diag(c)*b if trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). work, rwork, rpivot On exit, work(1) for real flavors, or rwork(1) for complex flavors (the Fortran interface) and rpivot (the C interface), contains the reciprocal pivot growth factor norm(A)/norm(U). The "max absolute element" norm is used. If work(1) for real flavors, or rwork(1) for 3 Intel® Math Kernel Library Reference Manual 580 complex flavors is much less than 1, then the stability of the LU factorization of the (equilibrated) matrix A could be poor. This also means that the solution x, condition estimator rcond, and forward error bound ferr could be unreliable. If factorization fails with 0 < info = n, then work(1) for real flavors, or rwork(1) for complex flavors contains the reciprocal pivot growth factor for the leading info columns of A. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n+1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbsvx interface are as follows: ab Holds the array A of size (kl+ku+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afb Holds the array AF of size (2*kl+ku+1,n). ipiv Holds the vector of length n. r Holds the vector of length n. Default value for each element is r(i) = 1.0_WP. c Holds the vector of length n. Default value for each element is c(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). trans Must be 'N', 'C', or 'T'. The default value is 'N'. equed Must be 'N', 'B', 'C', or 'R'. The default value is 'N'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. rpvgrw Real value that contains the reciprocal pivot growth factor norm(A)/ norm(U). kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. LAPACK Routines: Linear Equations 3 581 ?gbsvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a banded matrix A and multiple right-hand sides Syntax Fortran 77: call sgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zgbsvxx( fact, trans, n, kl, ku, nrhs, ab, ldab, afb, ldafb, ipiv, equed, r, c, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, float* r, float* c, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zgbsvxx( int matrix_order, char fact, char trans, lapack_int n, lapack_int kl, lapack_int ku, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, lapack_int* ipiv, char* equed, double* r, double* c, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); 3 Intel® Math Kernel Library Reference Manual 582 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n banded matrix, the columns of the matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?gbsvxx performs the following steps: 1. If fact = 'E', scaling factors r and c are computed to equilibrate the system: trans = 'N': diag(r)*A*diag(c)*inv(diag(c))*X = diag(r)*B trans = 'T': (diag(r)*A*diag(c))T*inv(diag(r))*X = diag(c)*B trans = 'C': (diag(r)*A*diag(c))H*inv(diag(r))*X = diag(c)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(r)*A*diag(c) and B by diag(r)*B (if trans='N') or diag(c)*B (if trans = 'T' or 'C'). 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = P*L*U, where P is a permutation matrix, L is a unit lower triangular matrix, and U is upper triangular. 3. If some Ui,i= 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to improve the computed solution matrix and calculate error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(c) (if trans = 'N') or diag(r) (if trans = 'T' or 'C') so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. LAPACK Routines: Linear Equations 3 583 If fact = 'F', on entry, afb and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by r and c. Parameters ab, afb, and ipiv are not modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to afb and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form A**T*X = B (Transpose). If trans = 'C', the system has the form A**H*X = B (Conjugate Transpose = Transpose for real flavors, Conjugate Transpose for complex flavors). n INTEGER. The number of linear equations; the order of the matrix A; n = 0. kl INTEGER. The number of subdiagonals within the band of A; kl = 0. ku INTEGER. The number of superdiagonals within the band of A; ku = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. ab, afb, b, work REAL for sgbsvxx DOUBLE PRECISION for dgbsvxx COMPLEX for cgbsvxx DOUBLE COMPLEX for zgbsvxx. Arrays: ab(ldab,*), afb(ldafb,*), b(ldb,*), work(*). The array ab contains the matrix A in band storage, in rows 1 to kl+ku+1. The j-th column of A is stored in the j-th column of the array ab as follows: ab(ku+1+i-j,j) = A(i,j) for max(1,j-ku) = i = min(n,j+kl). If fact = 'F' and equed is not 'N', then AB must have been equilibrated by the scaling factors in r and/or c. The second dimension of a must be at least max(1,n). The array afb is an input argument if fact = 'F'. It contains the factored form of the banded matrix A, that is, the factors L and U from the factorization A = P*L*U as computed by ?gbtrf. U is stored as an upper triangular banded matrix with kl + ku superdiagonals in rows 1 to kl + ku + 1. The multipliers used during the factorization are stored in rows kl + ku + 2 to 2*kl + ku + 1. If equed is not 'N', then afb is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of the array ab; ldab = kl+ku+1.. ldafb INTEGER. The leading dimension of the array afb; ldafb = 2*kl+ku+1.. ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains the pivot indices from the factorization A = P*L*U as computed by ?gbtrf; row i of the matrix was interchanged with row ipiv(i). equed CHARACTER*1. Must be 'N', 'R', 'C', or 'B'. 3 Intel® Math Kernel Library Reference Manual 584 equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). If equed = 'R', row equilibration was done, that is, A has been premultiplied by diag(r). If equed = 'C', column equilibration was done, that is, A has been postmultiplied by diag(c). If equed = 'B', both row and column equilibration was done, that is, A has been replaced by diag(r)*A*diag(c). r, c REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: r(n), c(n). The array r contains the row scale factors for A, and the array c contains the column scale factors for A. These arrays are input arguments if fact = 'F' only; otherwise they are output arguments. If equed = 'R' or 'B', A is multiplied on the left by diag(r); if equed = 'N' or 'C', r is not accessed. If fact = 'F' and equed = 'R' or 'B', each element of r must be positive. If equed = 'C' or 'B', A is multiplied on the right by diag(c); if equed = 'N' or 'R', c is not accessed. If fact = 'F' and equed = 'C' or 'B', each element of c must be positive. Each element of r or c should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1,n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1,n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. LAPACK Routines: Linear Equations 3 585 Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sgbsvxx DOUBLE PRECISION for dgbsvxx COMPLEX for cgbsvxx DOUBLE COMPLEX for zgbsvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(c))*X, if trans = 'N' and equed = 'C' or 'B'; or inv(diag(r))*X, if trans = 'T' or 'C' and equed = 'R' or 'B'. The second dimension of x must be at least max(1,nrhs). ab Array ab is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If equed ? 'N', A is scaled on exit as follows: equed = 'R': A = diag(r)*A equed = 'C': A = A*diag(c) equed = 'B': A = diag(r)*A*diag(c). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns the factors L and U from the factorization A = PLU of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). b Overwritten by diag(r)*B if trans = 'N' and equed = 'R' or 'B'; overwritten by trans = 'T' or 'C' and equed = 'C' or 'B'; not changed if equed = 'N'. r, c These arrays are output arguments if fact ? 'F'. Each element of these arrays is a power of the radix. See the description of r, c in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 586 Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. In ?gbsvx, this quantity is returned in work(1). berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors LAPACK Routines: Linear Equations 3 587 and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. 3 Intel® Math Kernel Library Reference Manual 588 Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N' or 'E', then ipiv is an output argument and on exit contains the pivot indices from the factorization A = P*L*U of the original matrix A (if fact = 'N') or of the equilibrated matrix A (if fact = 'E'). equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?gtsv Computes the solution to the system of linear equations with a tridiagonal matrix A and multiple right-hand sides. Syntax Fortran 77: call sgtsv( n, nrhs, dl, d, du, b, ldb, info ) call dgtsv( n, nrhs, dl, d, du, b, ldb, info ) call cgtsv( n, nrhs, dl, d, du, b, ldb, info ) call zgtsv( n, nrhs, dl, d, du, b, ldb, info ) Fortran 95: call gtsv( dl, d, du, b [,info] ) C: lapack_int LAPACKE_gtsv( int matrix_order, lapack_int n, lapack_int nrhs, * dl, * d, * du, * b, lapack_int ldb ); LAPACK Routines: Linear Equations 3 589 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The routine uses Gaussian elimination with partial pivoting. Note that the equation AT*X = B may be solved by interchanging the order of the arguments du and dl. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of A, the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sgtsv DOUBLE PRECISION for dgtsv COMPLEX for cgtsv DOUBLE COMPLEX for zgtsv. Arrays: dl(n - 1), d(n), du(n - 1), b(ldb,*). The array dl contains the (n - 1) subdiagonal elements of A. The array d contains the diagonal elements of A. The array du contains the (n - 1) superdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters dl Overwritten by the (n-2) elements of the second superdiagonal of the upper triangular matrix U from the LU factorization of A. These elements are stored in dl(1), ..., dl(n-2). d Overwritten by the n diagonal elements of U. du Overwritten by the (n-1) elements of the first superdiagonal of U. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, U(i, i) is exactly zero, and the solution has not been computed. The factorization has not been completed unless i = n. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtsv interface are as follows: 3 Intel® Math Kernel Library Reference Manual 590 dl Holds the vector of length (n-1). d Holds the vector of length n. dl Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). ?gtsvx Computes the solution to the real or complex system of linear equations with a tridiagonal matrix A and multiple right-hand sides, and provides error bounds on the solution. Syntax Fortran 77: call sgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zgtsvx( fact, trans, n, nrhs, dl, d, du, dlf, df, duf, du2, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call gtsvx( dl, d, du, b, x [,dlf] [,df] [,duf] [,du2] [,ipiv] [,fact] [,trans] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const float* dl, const float* d, const float* du, float* dlf, float* df, float* duf, float* du2, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const double* dl, const double* d, const double* du, double* dlf, double* df, double* duf, double* du2, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_float* dl, const lapack_complex_float* d, const lapack_complex_float* du, lapack_complex_float* dlf, lapack_complex_float* df, lapack_complex_float* duf, lapack_complex_float* du2, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zgtsvx( int matrix_order, char fact, char trans, lapack_int n, lapack_int nrhs, const lapack_complex_double* dl, const lapack_complex_double* d, const lapack_complex_double* du, lapack_complex_double* dlf, lapack_complex_double* df, lapack_complex_double* duf, lapack_complex_double* du2, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 591 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the LU factorization to compute the solution to a real or complex system of linear equations A*X = B, AT*X = B, or AH*X = B, where A is a tridiagonal matrix of order n, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?gtsvx performs the following steps: 1. If fact = 'N', the LU decomposition is used to factor the matrix A as A = L*U, where L is a product of permutation and unit lower bidiagonal matrices and U is an upper triangular matrix with nonzeroes in only the main diagonal and first two superdiagonals. 2. If some Ui,i = 0, so that U is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, dlf, df, duf, du2, and ipiv contain the factored form of A; arrays dl, d, du, dlf, df, duf, du2, and ipiv will not be modified. If fact = 'N', the matrix A will be copied to dlf, df, and duf and factored. trans CHARACTER*1. Must be 'N', 'T', or 'C'. Specifies the form of the system of equations: If trans = 'N', the system has the form A*X = B (No transpose). If trans = 'T', the system has the form AT*X = B (Transpose). If trans = 'C', the system has the form AH*X = B (Conjugate transpose). n INTEGER. The number of linear equations, the order of the matrix A; n = 0. nrhs INTEGER. The number of right hand sides, the number of columns of the matrices B and X; nrhs = 0. dl,d,du,dlf,df, duf,du2,b, x,work REAL for sgtsvx DOUBLE PRECISION for dgtsvx COMPLEX for cgtsvx DOUBLE COMPLEX for zgtsvx. 3 Intel® Math Kernel Library Reference Manual 592 Arrays: dl, DIMENSION (n -1), contains the subdiagonal elements of A. d, DIMENSION (n), contains the diagonal elements of A. du, DIMENSION (n -1), contains the superdiagonal elements of A. dlf, DIMENSION (n -1). If fact = 'F', then dlf is an input argument and on entry contains the (n -1) multipliers that define the matrix L from the LU factorization of A as computed by ?gttrf. df, DIMENSION (n). If fact = 'F', then df is an input argument and on entry contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf, DIMENSION (n -1). If fact = 'F', then duf is an input argument and on entry contains the (n -1) elements of the first superdiagonal of U. du2, DIMENSION (n -2). If fact = 'F', then du2 is an input argument and on entry contains the (n-2) elements of the second superdiagonal of U. b(ldb*) contains the right-hand side matrix B. The second dimension of b must be at least max(1, nrhs). x(ldx*) contains the solution matrix X. The second dimension of x must be at least max(1, nrhs). work(*) is a workspace array. DIMENSION of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). If fact = 'F', then ipiv is an input argument and on entry contains the pivot indices, as returned by ?gttrf. iwork INTEGER. Workspace array, DIMENSION (n). Used for real flavors only. rwork REAL for cgtsvx DOUBLE PRECISION for zgtsvx. Workspace array, DIMENSION (n). Used for complex flavors only. Output Parameters x REAL for sgtsvx DOUBLE PRECISION for dgtsvx COMPLEX for cgtsvx DOUBLE COMPLEX for zgtsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X. The second dimension of x must be at least max(1, nrhs). dlf If fact = 'N', then dlf is an output argument and on exit contains the (n-1) multipliers that define the matrix L from the LU factorization of A. df If fact = 'N', then df is an output argument and on exit contains the n diagonal elements of the upper triangular matrix U from the LU factorization of A. duf If fact = 'N', then duf is an output argument and on exit contains the (n-1) elements of the first superdiagonal of U. LAPACK Routines: Linear Equations 3 593 du2 If fact = 'N', then du2 is an output argument and on exit contains the (n-2) elements of the second superdiagonal of U. ipiv The array ipiv is an output argument if fact = 'N' and, on exit, contains the pivot indices from the factorization A = L*U ; row i of the matrix was interchanged with row ipiv(i). The value of ipiv(i) will always be i or i+1; ipiv(i)=i indicates a row interchange was not required. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then U(i, i) is exactly zero. The factorization has not been completed unless i = n, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine gtsvx interface are as follows: dl Holds the vector of length (n-1). d Holds the vector of length n. du Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). 3 Intel® Math Kernel Library Reference Manual 594 x Holds the matrix X of size (n,nrhs). dlf Holds the vector of length (n-1). df Holds the vector of length n. duf Holds the vector of length (n-1). du2 Holds the vector of length (n-2). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then the arguments dlf, df, duf, du2, and ipiv must be present; otherwise, an error is returned. trans Must be 'N', 'C', or 'T'. The default value is 'N'. ?dtsvb Computes the solution to the system of linear equations with a diagonally dominant tridiagonal matrix A and multiple right-hand sides. Syntax Fortran 77: call sdtsvb( n, nrhs, dl, d, du, b, ldb, info ) call ddtsvb( n, nrhs, dl, d, du, b, ldb, info ) call cdtsvb( n, nrhs, dl, d, du, b, ldb, info ) call zdtsvb( n, nrhs, dl, d, du, b, ldb, info ) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The ?dtsvb routine solves a system of linear equations A*X = B for X, where A is an n-by-n diagonally dominant tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The routine uses the BABE (Burning At Both Ends) algorithm. Note that the equation AT*X = B may be solved by interchanging the order of the arguments du and dl. Input Parameters n INTEGER. The order of A, the number of rows in B; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. dl, d, du, b REAL for sdtsvb DOUBLE PRECISION for ddtsvb COMPLEX for cdtsvb DOUBLE COMPLEX for zdtsvb. Arrays: dl(n - 1), d(n), du(n - 1), b(ldb,*). The array dl contains the (n - 1) subdiagonal elements of A. The array d contains the diagonal elements of A. LAPACK Routines: Linear Equations 3 595 The array du contains the (n - 1) superdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters dl Overwritten by the (n-1) elements of the subdiagonal of the lower triangular matrices L1, L2 from the factorization of A. d Overwritten by the n diagonal element reciprocals of U. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, uii is exactly zero, and the solution has not been computed. The factorization has not been completed unless i = n. Application Notes A diagonally dominant tridiagonal system is defined such that |di| > |dli-1| + |dui| for any i: 1 < i < n, and |d1| > |du1|, |dn| > |dln-1| The underlying BABE algorithm is designed for diagonally dominant systems. Such systems have no numerical stability issue unlike the canonical systems that use elimination with partial pivoting (see ?gtsv). The diagonally dominant systems are much faster than the canonical systems. NOTE • The current implementation of BABE has a potential accuracy issue on very small or large data close to the underflow or overflow threshold respectively. Scale the matrix before applying the solver in the case of such input data. • Applying the ?dtsvb factorization to non-diagonally dominant systems may lead to an accuracy loss, or false singularity detected due to no pivoting. ?posv Computes the solution to the system of linear equations with a symmetric or Hermitian positivedefinite matrix A and multiple right-hand sides. Syntax Fortran 77: call sposv( uplo, n, nrhs, a, lda, b, ldb, info ) call dposv( uplo, n, nrhs, a, lda, b, ldb, info ) call cposv( uplo, n, nrhs, a, lda, b, ldb, info ) call zposv( uplo, n, nrhs, a, lda, b, ldb, info ) call dsposv( uplo, n, nrhs, a, lda, b, ldb, x, ldx, work, swork, iter, info ) call zcposv( uplo, n, nrhs, a, lda, b, ldb, x, ldx, work, swork, rwork, iter, info ) 3 Intel® Math Kernel Library Reference Manual 596 Fortran 95: call posv( a, b [,uplo] [,info] ) C: lapack_int LAPACKE_posv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, * b, lapack_int ldb ); lapack_int LAPACKE_dsposv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* b, lapack_int ldb, double* x, lapack_int ldx, lapack_int* iter ); lapack_int LAPACKE_zcposv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, lapack_int* iter ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive-definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. The factored form of A is then used to solve the system of equations A*X = B. The dsposv and zcposv are mixed precision iterative refinement subroutines for exploiting fast single precision hardware. They first attempt to factorize the matrix in single precision (dsposv) or single complex precision (zcposv) and use this factorization within an iterative refinement procedure to produce a solution with double precision (dsposv) / double complex precision (zcposv) normwise backward error quality (see below). If the approach fails, the method switches to a double precision or double complex precision factorization respectively and computes the solution. The iterative refinement is not going to be a winning strategy if the ratio single precision/COMPLEX performance over double precision/DOUBLE COMPLEX performance is too small. A reasonable strategy should take the number of right-hand sides and the size of the matrix into account. This might be done with a call to ilaenv in the future. At present, iterative refinement is implemented. The iterative refinement process is stopped if iter > itermax or for all the right-hand sides: rnmr < sqrt(n)*xnrm*anrm*eps*bwdmax, where • iter is the number of the current iteration in the iterative refinement process • rnmr is the infinity-norm of the residual • xnrm is the infinity-norm of the solution • anrm is the infinity-operator-norm of the matrix A • eps is the machine epsilon returned by dlamch (‘Epsilon’). LAPACK Routines: Linear Equations 3 597 The values itermax and bwdmax are fixed to 30 and 1.0d+00 respectively. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, b REAL for sposv DOUBLE PRECISION for dposv and dsposv. COMPLEX for cposv DOUBLE COMPLEX for zposv and zcposv. Arrays: a(lda,*), b(ldb,*). The array a contains the upper or the lower triangular part of the matrix A (see uplo). The second dimension of a must be at least max(1, n). Note that in the case of zcposv the imaginary parts of the diagonal elements need not be set and are assumed to be zero. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of the array x; ldx = max(1, n). work DOUBLE PRECISION for dsposv DOUBLE COMPLEX for zcposv. Workspace array, DIMENSION (n*nrhs). This array is used to hold the residual vectors. swork REAL for dsgesv COMPLEX for zcgesv. Workspace array, DIMENSION (n*(n+nrhs)). This array is used to use the single precision matrix and the right-hand sides or solutions in single precision. rwork DOUBLE PRECISION. Workspace array, DIMENSION (n). Output Parameters a If info = 0, the upper or lower triangular part of a is overwritten by the Cholesky factor U or L, as specified by uplo. If iterative refinement has been successfully used (info= 0 and iter= 0), then A is unchanged. If double precision factorization has been used (info= 0 and iter < 0), then the array A contains the factors L and U from the Cholesky factorization; the unit diagonal elements of L are not stored. b Overwritten by the solution matrix X. ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 598 Array, DIMENSION at least max(1, n). The pivot indices that define the permutation matrix P; row i of the matrix was interchanged with row ipiv(i). Corresponds to the single precision factorization (if info= 0 and iter = 0) or the double precision factorization (if info= 0 and iter < 0). x DOUBLE PRECISION for dsposv DOUBLE COMPLEX for zcposv. Array, DIMENSION (ldx, nrhs). If info = 0, contains the n-by-nrhs solution matrix X. iter INTEGER. If iter < 0: iterative refinement has failed, double precision factorization has been performed • If iter = -1: the routine fell back to full precision for implementation- or machine-specific reason • If iter = -2: narrowing the precision induced an overflow, the routine fell back to full precision • If iter = -3: failure of spotrf for dsposv, or cpotrf for zcposv • If iter = -31: stop the iterative refinement after the 30th iteration. If iter > 0: iterative refinement has been successfully used. Returns the number of iterations. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive definite, so the factorization could not be completed, and the solution has not been computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine posv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?posvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric or Hermitian positive-definite matrix A, and provides error bounds on the solution. Syntax Fortran 77: call sposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) LAPACK Routines: Linear Equations 3 599 call cposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zposvx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call posvx( a, b, x [,uplo] [,af] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zposvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n real symmetric/Hermitian positive definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?posvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', 3 Intel® Math Kernel Library Reference Manual 600 where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n + 1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, af contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. a and af will not be modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, af, b, work REAL for sposvx DOUBLE PRECISION for dposvx COMPLEX for cposvx DOUBLE COMPLEX for zposvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A as specified by uplo. If fact = 'F' and equed = 'Y', then A must have been equilibrated by the scaling factors in s, and a must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then af is the factored form of the equilibrated matrix diag(s)*A*diag(s). The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). LAPACK Routines: Linear Equations 3 601 work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N'); if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cposvx DOUBLE PRECISION for zposvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sposvx DOUBLE PRECISION for dposvx COMPLEX for cposvx DOUBLE COMPLEX for zposvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). af If fact = 'N' or 'E', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=U**T*U or A=L*L**T (real routines), A=U**H*U or A=L*L**H (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. 3 Intel® Math Kernel Library Reference Manual 602 b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond =0), the matrix is singular to working precision. This condition is indicated by a return code of info>0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine posvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). LAPACK Routines: Linear Equations 3 603 af Holds the matrix AF of size (n,n). s Holds the vector of length n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?posvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a symmetric or Hermitian positive-definite matrix A applying the Cholesky factorization. Syntax Fortran 77: call sposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call cposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zposvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_sposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_cposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); 3 Intel® Math Kernel Library Reference Manual 604 lapack_int LAPACKE_zposvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian positive definite matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?posvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. LAPACK Routines: Linear Equations 3 605 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af contains the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a and af are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for sposvxx DOUBLE PRECISION for dposvxx COMPLEX for cposvxx DOUBLE COMPLEX for zposvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the matrix A as specified by uplo . If fact = 'F' and equed = 'Y', then A must have been equilibrated by the scaling factors in s, and a must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then af is the factored form of the equilibrated matrix diag(s)*A*diag(s). The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). 3 Intel® Math Kernel Library Reference Manual 606 s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors LAPACK Routines: Linear Equations 3 607 DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for sposvxx DOUBLE PRECISION for dposvxx COMPLEX for cposvxx DOUBLE COMPLEX for zposvxx. Array, DIMENSION (ldx,*). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a Array a is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). af If fact = 'N' or 'E', then af is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=U**T*U or A=L*L**T (real routines), A=U**H*U or A=L*L**H (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of a for the form of the equilibrated matrix. b If equed = 'N', B is not modified. If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. 3 Intel® Math Kernel Library Reference Manual 608 Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: LAPACK Routines: Linear Equations 3 609 The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested 3 Intel® Math Kernel Library Reference Manual 610 params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?ppsv Computes the solution to the system of linear equations with a symmetric (Hermitian) positive definite packed matrix A and multiple right-hand sides. Syntax Fortran 77: call sppsv( uplo, n, nrhs, ap, b, ldb, info ) call dppsv( uplo, n, nrhs, ap, b, ldb, info ) call cppsv( uplo, n, nrhs, ap, b, ldb, info ) call zppsv( uplo, n, nrhs, ap, b, ldb, info ) Fortran 95: call ppsv( ap, b [,uplo] [,info] ) C: lapack_int LAPACKE_ppsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian positive-definite matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Linear Equations 3 611 uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, b REAL for sppsv DOUBLE PRECISION for dppsv COMPLEX for cppsv DOUBLE COMPLEX for zppsv. Arrays: ap(*), b(ldb,*). The array ap contains the upper or the lower triangular part of the matrix A (as specified by uplo) in packed storage (see Matrix Storage Schemes). The dimension of ap must be at least max(1,n(n+1)/2). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap If info = 0, the upper or lower triangular part of A in packed storage is overwritten by the Cholesky factor U or L, as specified by uplo. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution has not been computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?ppsvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive definite packed matrix A, and provides error bounds on the solution. 3 Intel® Math Kernel Library Reference Manual 612 Syntax Fortran 77: call sppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zppsvx( fact, uplo, n, nrhs, ap, afp, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call ppsvx( ap, b, x [,uplo] [,af] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* ap, float* afp, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* ap, double* afp, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* ap, lapack_complex_float* afp, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zppsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* ap, lapack_complex_double* afp, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive-definite matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?ppsvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. LAPACK Routines: Linear Equations 3 613 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular matrix and L is a lower triangular matrix. 3. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F': on entry, afp contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. ap and afp will not be modified. If fact = 'N', the matrix A will be copied to afp and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. ap, afp, b, work REAL for sppsvx DOUBLE PRECISION for dppsvx COMPLEX for cppsvx DOUBLE COMPLEX for zppsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the original symmetric/Hermitian matrix A in packed storage (see Matrix Storage Schemes). In case when fact = 'F' and equed = 'Y', ap must contain the equilibrated matrix diag(s)*A*diag(s). 3 Intel® Math Kernel Library Reference Manual 614 The array afp is an input argument if fact = 'F' and contains the triangular factor U or L from the Cholesky factorization of A in the same storage format as A. If equed is not 'N', then afp is the factored form of the equilibrated matrix A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1, 3*n) for real flavors and max(1, 2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N'); if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cppsvx; DOUBLE PRECISION for zppsvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sppsvx DOUBLE PRECISION for dppsvx COMPLEX for cppsvx DOUBLE COMPLEX for zppsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). ap Array ap is not modified on exit if fact = 'F' or 'N', or if fact = 'E' and equed = 'N'. If fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). LAPACK Routines: Linear Equations 3 615 afp If fact = 'N' or 'E', then afp is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=UT*U or A=L*LT (real routines), A=UH*U or A=L*LH (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of ap for the form of the equilibrated matrix. b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info=0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. 3 Intel® Math Kernel Library Reference Manual 616 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ppsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the matrix AF of size (n*(n+1)/2). s Holds the vector of length n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?pbsv Computes the solution to the system of linear equations with a symmetric or Hermitian positivedefinite band matrix A and multiple right-hand sides. Syntax Fortran 77: call spbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call dpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call cpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) call zpbsv( uplo, n, kd, nrhs, ab, ldab, b, ldb, info ) Fortran 95: call pbsv( ab, b [,uplo] [,info] ) C: lapack_int LAPACKE_pbsv( int matrix_order, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, * ab, lapack_int ldab, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive definite band matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. LAPACK Routines: Linear Equations 3 617 The Cholesky decomposition is used to factor A as A = UT*U (real flavors) and A = UH*U (complex flavors), if uplo = 'U' or A = L*LT (real flavors) and A = L*LH (complex flavors), if uplo = 'L', where U is an upper triangular band matrix and L is a lower triangular band matrix, with the same number of superdiagonals or subdiagonals as A. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals of the matrix A if uplo = 'U', or the number of subdiagonals if uplo = 'L'; kd = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ab, b REAL for spbsv DOUBLE PRECISION for dpbsv COMPLEX for cpbsv DOUBLE COMPLEX for zpbsv. Arrays: ab(ldab, *), b(ldb,*). The array ab contains the upper or the lower triangular part of the matrix A (as specified by uplo) in band storage (see Matrix Storage Schemes). The second dimension of ab must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldab INTEGER. The leading dimension of the array ab; ldab = kd +1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ab The upper or lower triangular part of A (in band storage) is overwritten by the Cholesky factor U or L, as specified by uplo, in the same storage format as A. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution has not been computed. 3 Intel® Math Kernel Library Reference Manual 618 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbsv interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. ?pbsvx Uses the Cholesky factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive-definite band matrix A, and provides error bounds on the solution. Syntax Fortran 77: call spbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zpbsvx( fact, uplo, n, kd, nrhs, ab, ldab, afb, ldafb, equed, s, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call pbsvx( ab, b, x [,uplo] [,afb] [,fact] [,equed] [,s] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_spbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, float* ab, lapack_int ldab, float* afb, lapack_int ldafb, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, double* ab, lapack_int ldab, double* afb, lapack_int ldafb, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, lapack_complex_float* ab, lapack_int ldab, lapack_complex_float* afb, lapack_int ldafb, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); LAPACK Routines: Linear Equations 3 619 lapack_int LAPACKE_zpbsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int kd, lapack_int nrhs, lapack_complex_double* ab, lapack_int ldab, lapack_complex_double* afb, lapack_int ldafb, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive definite band matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?pbsvx performs the following steps: 1. If fact = 'E', real scaling factors s are computed to equilibrate the system: diag(s)*A*diag(s)*inv(diag(s))*X = diag(s)*B. Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the Cholesky decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = UT*U (real), A = UH*U (complex), if uplo = 'U', or A = L*LT (real), A = L*LH (complex), if uplo = 'L', where U is an upper triangular band matrix and L is a lower triangular band matrix. 3. If the leading i-by-i principal minor is not positive definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 4. The system of equations is solved for X using the factored form of A. 5. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. 6. If equilibration was used, the matrix X is premultiplied by diag(s) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. 3 Intel® Math Kernel Library Reference Manual 620 If fact = 'F': on entry, afb contains the factored form of A. If equed = 'Y', the matrix A has been equilibrated with scaling factors given by s. ab and afb will not be modified. If fact = 'N', the matrix A will be copied to afb and factored. If fact = 'E', the matrix A will be equilibrated if necessary, then copied to afb and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. kd INTEGER. The number of superdiagonals or subdiagonals in the matrix A; kd = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ab, afb, b, work REAL for spbsvx DOUBLE PRECISION for dpbsvx COMPLEX for cpbsvx DOUBLE COMPLEX for zpbsvx. Arrays: ab(ldab,*), afb(ldab,*), b(ldb,*), work(*). The array ab contains the upper or lower triangle of the matrix A in band storage (see Matrix Storage Schemes). If fact = 'F' and equed = 'Y', then ab must contain the equilibrated matrix diag(s)*A*diag(s). The second dimension of ab must be at least max(1, n). The array afb is an input argument if fact = 'F'. It contains the triangular factor U or L from the Cholesky factorization of the band matrix A in the same storage format as A. If equed = 'Y', then afb is the factored form of the equilibrated matrix A. The second dimension of afb must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,3*n) for real flavors, and at least max(1,2*n) for complex flavors. ldab INTEGER. The leading dimension of ab; ldab = kd+1. ldafb INTEGER. The leading dimension of afb; ldafb = kd+1. ldb INTEGER. The leading dimension of b; ldb = max(1, n). equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: if equed = 'N', no equilibration was done (always true if fact = 'N') if equed = 'Y', equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. LAPACK Routines: Linear Equations 3 621 Array, DIMENSION (n). The array s contains the scale factors for A. This array is an input argument if fact = 'F' only; otherwise it is an output argument. If equed = 'N', s is not accessed. If fact = 'F' and equed = 'Y', each element of s must be positive. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cpbsvx DOUBLE PRECISION for zpbsvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for spbsvx DOUBLE PRECISION for dpbsvx COMPLEX for cpbsvx DOUBLE COMPLEX for zpbsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the original system of equations. Note that if equed = 'Y', A and B are modified on exit, and the solution to the equilibrated system is inv(diag(s))*X. The second dimension of x must be at least max(1,nrhs). ab On exit, if fact = 'E' and equed = 'Y', A is overwritten by diag(s)*A*diag(s). afb If fact = 'N' or 'E', then afb is an output argument and on exit returns the triangular factor U or L from the Cholesky factorization A=UT*U or A=L*LT (real routines), A=UH*U or A=L*LH (complex routines) of the original matrix A (if fact = 'N'), or of the equilibrated matrix A (if fact = 'E'). See the description of ab for the form of the equilibrated matrix. b Overwritten by diag(s)*B, if equed = 'Y'; not changed if equed = 'N'. s This array is an output argument if fact ? 'F'. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the 3 Intel® Math Kernel Library Reference Manual 622 largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. equed If fact ?'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine pbsvx interface are as follows: ab Holds the array A of size (kd+1,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afb Holds the array AF of size (kd+1,n). s Holds the vector with the number of elements n. Default value for each element is s(i) = 1.0_WP. ferr Holds the vector with the number of elements nrhs. berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N', 'E', or 'F'. The default value is 'N'. If fact = 'F', then af must be present; otherwise, an error is returned. equed Must be 'N' or 'Y'. The default value is 'N'. ?ptsv Computes the solution to the system of linear equations with a symmetric or Hermitian positive definite tridiagonal matrix A and multiple right-hand sides. LAPACK Routines: Linear Equations 3 623 Syntax Fortran 77: call sptsv( n, nrhs, d, e, b, ldb, info ) call dptsv( n, nrhs, d, e, b, ldb, info ) call cptsv( n, nrhs, d, e, b, ldb, info ) call zptsv( n, nrhs, d, e, b, ldb, info ) Fortran 95: call ptsv( d, e, b [,info] ) C: lapack_int LAPACKE_sptsv( int matrix_order, lapack_int n, lapack_int nrhs, float* d, float* e, float* b, lapack_int ldb ); lapack_int LAPACKE_dptsv( int matrix_order, lapack_int n, lapack_int nrhs, double* d, double* e, double* b, lapack_int ldb ); lapack_int LAPACKE_cptsv( int matrix_order, lapack_int n, lapack_int nrhs, float* d, lapack_complex_float* e, lapack_complex_float* b, lapack_int ldb ); lapack_int LAPACKE_zptsv( int matrix_order, lapack_int n, lapack_int nrhs, double* d, lapack_complex_double* e, lapack_complex_double* b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric/Hermitian positive-definite tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. A is factored as A = L*D*LT (real flavors) or A = L*D*LH (complex flavors), and the factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. d REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, dimension at least max(1, n). Contains the diagonal elements of the tridiagonal matrix A. e, b REAL for sptsv DOUBLE PRECISION for dptsv COMPLEX for cptsv 3 Intel® Math Kernel Library Reference Manual 624 DOUBLE COMPLEX for zptsv. Arrays: e(n - 1), b(ldb,*). The array e contains the (n - 1) subdiagonal elements of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters d Overwritten by the n diagonal elements of the diagonal matrix D from the L*D*LT (real)/ L*D*LH (complex) factorization of A. e Overwritten by the (n - 1) subdiagonal elements of the unit bidiagonal factor L from the factorization of A. b Overwritten by the solution matrix X. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, and the solution has not been computed. The factorization has not been completed unless i = n. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptsv interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). ?ptsvx Uses factorization to compute the solution to the system of linear equations with a symmetric (Hermitian) positive definite tridiagonal matrix A, and provides error bounds on the solution. Syntax Fortran 77: call sptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, info ) call dptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, info ) call cptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zptsvx( fact, n, nrhs, d, e, df, ef, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) LAPACK Routines: Linear Equations 3 625 Fortran 95: call ptsvx( d, e, b, x [,df] [,ef] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_sptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const float* d, const float* e, float* df, float* ef, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const double* d, const double* e, double* df, double* ef, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const float* d, const lapack_complex_float* e, float* df, lapack_complex_float* ef, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zptsvx( int matrix_order, char fact, lapack_int n, lapack_int nrhs, const double* d, const lapack_complex_double* e, double* df, lapack_complex_double* ef, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the Cholesky factorization A = L*D*LT (real)/A = L*D*LH (complex) to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric or Hermitian positive definite tridiagonal matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?ptsvx performs the following steps: 1. If fact = 'N', the matrix A is factored as A = L*D*LT (real flavors)/A = L*D*LH (complex flavors), where L is a unit lower bidiagonal matrix and D is diagonal. The factorization can also be regarded as having the form A = UT*D*U (real flavors)/A = UH*D*U (complex flavors). 2. If the leading i-by-i principal minor is not positive-definite, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. 3 Intel® Math Kernel Library Reference Manual 626 Specifies whether or not the factored form of the matrix A is supplied on entry. If fact = 'F': on entry, df and ef contain the factored form of A. Arrays d, e, df, and ef will not be modified. If fact = 'N', the matrix A will be copied to df and ef, and factored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. d, df, rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays: d(n), df(n), rwork(n). The array d contains the n diagonal elements of the tridiagonal matrix A. The array df is an input argument if fact = 'F' and on entry contains the n diagonal elements of the diagonal matrix D from the L*D*LT (real)/ L*D*LH (complex) factorization of A. The array rwork is a workspace array used for complex flavors only. e,ef,b,work REAL for sptsvx DOUBLE PRECISION for dptsvx COMPLEX for cptsvx DOUBLE COMPLEX for zptsvx. Arrays: e(n -1), ef(n -1), b(ldb*), work(*). The array e contains the (n - 1) subdiagonal elements of the tridiagonal matrix A. The array ef is an input argument if fact = 'F' and on entry contains the (n - 1) subdiagonal elements of the unit bidiagonal factor L from the L*D*LT (real)/ L*D*LH (complex) factorization of A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The array work is a workspace array. The dimension of work must be at least 2*n for real flavors, and at least n for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ldx INTEGER. The leading dimension of x; ldx = max(1, n). Output Parameters x REAL for sptsvx DOUBLE PRECISION for dptsvx COMPLEX for cptsvx DOUBLE COMPLEX for zptsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). df, ef These arrays are output arguments if fact = 'N'. See the description of df, ef in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. LAPACK Routines: Linear Equations 3 627 ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, the leading minor of order i (and therefore the matrix A itself) is not positive-definite, so the factorization could not be completed, and the solution and error bounds could not be computed; rcond =0 is returned. If info = i, and i = n + 1, then U is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine ptsvx interface are as follows: d Holds the vector of length n. e Holds the vector of length (n-1). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). df Holds the vector of length n. ef Holds the vector of length (n-1). ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. 3 Intel® Math Kernel Library Reference Manual 628 ?sysv Computes the solution to the system of linear equations with a real or complex symmetric matrix A and multiple right-hand sides. Syntax Fortran 77: call ssysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call dsysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call csysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call zsysv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) Fortran 95: call sysv( a, b [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_sysv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. a, b, work REAL for ssysv DOUBLE PRECISION for dsysv LAPACK Routines: Linear Equations 3 629 COMPLEX for csysv DOUBLE COMPLEX for zsysv. Arrays: a(lda,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the symmetric matrix A (see uplo). The second dimension of a must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). lwork INTEGER. The size of the work array; lwork = 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. Output Parameters a If info = 0, a is overwritten by the block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?sytrf. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?sytrf. If ipiv(i) = k >0, then dii is a 1-by-1 diagonal block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sysv interface are as follows: a Holds the matrix A of size (n,n). 3 Intel® Math Kernel Library Reference Manual 630 b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sysvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a real or complex symmetric matrix A, and provides error bounds on the solution. Syntax Fortran 77: call ssysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, iwork, info ) call dsysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, iwork, info ) call csysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) call zsysvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) Fortran 95: call sysvx( a, b, x [,uplo] [,af] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_ssysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dsysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); LAPACK Routines: Linear Equations 3 631 lapack_int LAPACKE_csysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zsysvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?sysvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i= 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, af and ipiv contain the factored form of A. Arrays a, af, and ipiv will not be modified. If fact = 'N', the matrix A will be copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. 3 Intel® Math Kernel Library Reference Manual 632 a, af, b, work REAL for ssysvx DOUBLE PRECISION for dsysvx COMPLEX for csysvx DOUBLE COMPLEX for zsysvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the symmetric matrix A (see uplo). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains he block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T orA = L*D*L**T as computed by ?sytrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?sytrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) = ipiv(i-1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) = ipiv(i+1) = -m < 0, then D has a 2-by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). lwork INTEGER. The size of the work array. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for csysvx; DOUBLE PRECISION for zsysvx. Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for ssysvx DOUBLE PRECISION for dsysvx LAPACK Routines: Linear Equations 3 633 COMPLEX for csysvx DOUBLE COMPLEX for zsysvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). af, ipiv These arrays are output arguments if fact = 'N'. See the description of af, ipiv in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine sysvx interface are as follows: 3 Intel® Math Kernel Library Reference Manual 634 a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. Application Notes The value of lwork must be at least max(1,m*n), where for real flavors m = 3 and for complex flavors m = 2. For better performance, try using lwork = max(1, m*n, n*blocksize), where blocksize is the optimal block size for ?sytrf. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?sysvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a symmetric indefinite matrix A applying the diagonal pivoting factorization. Syntax Fortran 77: call ssysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call dsysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, iwork, info ) call csysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) LAPACK Routines: Linear Equations 3 635 call zsysvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_ssysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, float* a, lapack_int lda, float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_dsysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, double* a, lapack_int lda, double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); lapack_int LAPACKE_csysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zsysvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization A=UT*U (real flavors) / A=UH*U (complex flavors) or A=L*LT (real flavors) / A=L*LH (complex flavors) to compute the solution to a real or complex system of linear equations A*X = B, where A is an n-by-n real symmetric/Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. The routine ?sysvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B 3 Intel® Math Kernel Library Reference Manual 636 Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = U*D*UT, if uplo = 'U', or A = L*D*LT, if uplo = 'L', where U or L is a product of permutation and unit upper (lower) triangular matrices, and D is a symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 3. If some D(i,i)=0, so that D is exactly singular, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(r) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work REAL for ssysvxx DOUBLE PRECISION for dsysvxx COMPLEX for csysvxx DOUBLE COMPLEX for zsysvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the symmetric matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a LAPACK Routines: Linear Equations 3 637 contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U and L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ? sytrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,4*n) for real flavors, and at least max(1,2*n) for complex flavors. lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D as determined by ?sytrf. If ipiv(k) > 0, rows and columns k and ipiv(k) were intercanaged and D(k,k) is a 1-by-1 diagonal block. If ipiv = 'U' and ipiv(k) = ipiv(k-1) < 0, rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If ipiv = 'L' and ipiv(k) = ipiv(k+1) < 0, rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'Y', A is multiplied on the left and right by diag(s). This array is an input argument if fact = 'F' only; otherwise it is an output argument. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. 3 Intel® Math Kernel Library Reference Manual 638 params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x REAL for ssysvxx DOUBLE PRECISION for dsysvxx COMPLEX for csysvxx DOUBLE COMPLEX for zsysvxx. Array, DIMENSION (ldx,nrhs). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a If fact = 'E' and equed = 'Y', overwritten by diag(s)*A*diag(s). af If fact = 'N', af is an output argument and on exit returns the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T. b If equed = 'N', B is not modified. LAPACK Routines: Linear Equations 3 639 If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and 3 Intel® Math Kernel Library Reference Manual 640 sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold LAPACK Routines: Linear Equations 3 641 sqrt(n)*slamch(e) for single precision flavors and sqrt(n)*dlamch(e) for double precision flavors to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N', ipiv is an output argument and on exit contains details of the interchanges and the block structure D, as determined by ssytrf for single precision flavors and dsytrf for double precision flavors. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?hesv Computes the solution to the system of linear equations with a Hermitian matrix A and multiple right-hand sides. Syntax Fortran 77: call chesv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) call zhesv( uplo, n, nrhs, a, lda, ipiv, b, ldb, work, lwork, info ) Fortran 95: call hesv( a, b [,uplo] [,ipiv] [,info] ) 3 Intel® Math Kernel Library Reference Manual 642 C: lapack_int LAPACKE_hesv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * a, lapack_int lda, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the matrix A, and A is factored as U*D*UH. If uplo = 'L', the array a stores the lower triangular part of the matrix A, and A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, b, work COMPLEX for chesv DOUBLE COMPLEX for zhesv. Arrays: a(lda,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the Hermitian matrix A (see uplo). The second dimension of a must be at least max(1, n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work is a workspace array, dimension at least max(1,lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). lwork INTEGER. The size of the work array (lwork = 1). LAPACK Routines: Linear Equations 3 643 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. Output Parameters a If info = 0, a is overwritten by the block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?hetrf. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?hetrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hesv interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector of length n. uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 3 Intel® Math Kernel Library Reference Manual 644 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hesvx Uses the diagonal pivoting factorization to compute the solution to the complex system of linear equations with a Hermitian matrix A, and provides error bounds on the solution. Syntax Fortran 77: call chesvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) call zhesvx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, lwork, rwork, info ) Fortran 95: call hesvx( a, b, x [,uplo] [,af] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_chesvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zhesvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex system of linear equations A*X = B, where A is an n-by-n Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?hesvx performs the following steps: LAPACK Routines: Linear Equations 3 645 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, af and ipiv contain the factored form of A. Arrays a, af, and ipiv are not modified. If fact = 'N', the matrix A is copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array a stores the upper triangular part of the Hermitian matrix A, and A is factored as U*D*UH. If uplo = 'L', the array a stores the lower triangular part of the Hermitian matrix A; A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. a, af, b, work COMPLEX for chesvx DOUBLE COMPLEX for zhesvx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). The array a contains the upper or the lower triangular part of the Hermitian matrix A (see uplo). The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains he block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UH or A = L*D*LH as computed by ? hetrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1, nrhs). work(*) is a workspace array of dimension at least max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldaf INTEGER. The leading dimension of af; ldaf = max(1, n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 646 Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?hetrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). lwork INTEGER. The size of the work array. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details and for the suggested value of lwork. rwork REAL for chesvx DOUBLE PRECISION for zhesvx. Workspace array, DIMENSION at least max(1, n). Output Parameters x COMPLEX for chesvx DOUBLE COMPLEX for zhesvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). af, ipiv These arrays are output arguments if fact = 'N'. See the description of af, ipiv in Input Arguments section. rcond REAL for chesvx DOUBLE PRECISION for zhesvx. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for chesvx DOUBLE PRECISION for zhesvx. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcon, and is almost always a slight overestimate of the true error. berr REAL for chesvx DOUBLE PRECISION for zhesvx. LAPACK Routines: Linear Equations 3 647 Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hesvx interface are as follows: a Holds the matrix A of size (n,n). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). af Holds the matrix AF of size (n,n). ipiv Holds the vector of length n. ferr Holds the vector of length (nrhs). berr Holds the vector of length (nrhs). uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. Application Notes The value of lwork must be at least 2*n. For better performance, try using lwork = n*blocksize, where blocksize is the optimal block size for ?hetrf. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. 3 Intel® Math Kernel Library Reference Manual 648 Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?hesvxx Uses extra precise iterative refinement to compute the solution to the system of linear equations with a Hermitian indefinite matrix A applying the diagonal pivoting factorization. Syntax Fortran 77: call chesvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) call zhesvxx( fact, uplo, n, nrhs, a, lda, af, ldaf, ipiv, equed, s, b, ldb, x, ldx, rcond, rpvgrw, berr, n_err_bnds, err_bnds_norm, err_bnds_comp, nparams, params, work, rwork, info ) C: lapack_int LAPACKE_chesvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_float* a, lapack_int lda, lapack_complex_float* af, lapack_int ldaf, lapack_int* ipiv, char* equed, float* s, lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* rpvgrw, float* berr, lapack_int n_err_bnds, float* err_bnds_norm, float* err_bnds_comp, lapack_int nparams, const float* params ); lapack_int LAPACKE_zhesvxx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, lapack_complex_double* a, lapack_int lda, lapack_complex_double* af, lapack_int ldaf, lapack_int* ipiv, char* equed, double* s, lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* rpvgrw, double* berr, lapack_int n_err_bnds, double* err_bnds_norm, double* err_bnds_comp, lapack_int nparams, const double* params ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex/double complex system of linear equations A*X = B, where A is an n-by-n Hermitian matrix, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Both normwise and maximum componentwise error bounds are also provided on request. The routine returns a solution with a small guaranteed error (O(eps), where eps is the working machine precision) unless the matrix is very ill-conditioned, in which case a warning is returned. Relevant condition numbers are also calculated and returned. The routine accepts user-provided factorizations and equilibration factors; see definitions of the fact and equed options. Solving with refinement and using a factorization from a previous call of the routine also produces a solution with O(eps) errors or warnings but that may not be true for general user-provided factorizations and equilibration factors if they differ from what the routine would itself produce. LAPACK Routines: Linear Equations 3 649 The routine ?hesvxx performs the following steps: 1. If fact = 'E', scaling factors are computed to equilibrate the system: diag(s)*A*diag(s) *inv(diag(s))*X = diag(s)*B Whether or not the system will be equilibrated depends on the scaling of the matrix A, but if equilibration is used, A is overwritten by diag(s)*A*diag(s) and B by diag(s)*B. 2. If fact = 'N' or 'E', the LU decomposition is used to factor the matrix A (after equilibration if fact = 'E') as A = U*D*UT, if uplo = 'U', or A = L*D*LT, if uplo = 'L', where U or L is a product of permutation and unit upper (lower) triangular matrices, and D is a symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 3. If some D(i,i)=0, so that D is exactly singular, the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A (see the rcond parameter). If the reciprocal of the condition number is less than machine precision, the routine still goes on to solve for X and compute error bounds. 4. The system of equations is solved for X using the factored form of A. 5. By default, unless params(la_linrx_itref_i) is set to zero, the routine applies iterative refinement to get a small error and error bounds. Refinement calculates the residual to at least twice the working precision. 6. If equilibration was used, the matrix X is premultiplied by diag(r) so that it solves the original system before equilibration. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F', 'N', or 'E'. Specifies whether or not the factored form of the matrix A is supplied on entry, and if not, whether the matrix A should be equilibrated before it is factored. If fact = 'F', on entry, af and ipiv contain the factored form of A. If equed is not 'N', the matrix A has been equilibrated with scaling factors given by s. Parameters a, af, and ipiv are not modified. If fact = 'N', the matrix A will be copied to af and factored. If fact = 'E', the matrix A will be equilibrated, if necessary, copied to af and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The number of linear equations; the order of the matrix A; n = 0. nrhs INTEGER. The number of right-hand sides; the number of columns of the matrices B and X; nrhs = 0. a, af, b, work COMPLEX for chesvxx DOUBLE COMPLEX for zhesvxx. Arrays: a(lda,*), af(ldaf,*), b(ldb,*), work(*). 3 Intel® Math Kernel Library Reference Manual 650 The array a contains the Hermitian matrix A as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A and the strictly lower triangular part of a is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A and the strictly upper triangular part of a is not referenced. The second dimension of a must be at least max(1,n). The array af is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U and L from the factorization A = U*D*U**T or A = L*D*L**T as computed by ? hetrf. The second dimension of af must be at least max(1,n). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). work(*) is a workspace array. The dimension of work must be at least max(1,2*n). lda INTEGER. The leading dimension of the array a; lda = max(1,n). ldaf INTEGER. The leading dimension of the array af; ldaf = max(1,n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D as determined by ?sytrf. If ipiv(k) > 0, rows and columns k and ipiv(k) were intercanaged and D(k,k) is a 1-by-1 diagonal block. If ipiv = 'U' and ipiv(k) = ipiv(k-1) < 0, rows and columns k-1 and -ipiv(k) were interchanged and D(k-1:k, k-1:k) is a 2-by-2 diagonal block. If ipiv = 'L' and ipiv(k) = ipiv(k+1) < 0, rows and columns k+1 and -ipiv(k) were interchanged and D(k:k+1, k:k+1) is a 2-by-2 diagonal block. equed CHARACTER*1. Must be 'N' or 'Y'. equed is an input argument if fact = 'F'. It specifies the form of equilibration that was done: If equed = 'N', no equilibration was done (always true if fact = 'N'). if equed = 'Y', both row and column equilibration was done, that is, A has been replaced by diag(s)*A*diag(s). s REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (n). The array s contains the scale factors for A. If equed = 'Y', A is multiplied on the left and right by diag(s). This array is an input argument if fact = 'F' only; otherwise it is an output argument. If fact = 'F' and equed = 'Y', each element of s must be positive. Each element of s should be a power of the radix to ensure a reliable solution and error estimates. Scaling by powers of the radix does not cause rounding errors unless the result underflows or overflows. Rounding errors during scaling lead to refining with a matrix that is not equivalent to the input matrix, producing error estimates that may not be reliable. ldb INTEGER. The leading dimension of the array b; ldb = max(1, n). ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). LAPACK Routines: Linear Equations 3 651 n_err_bnds INTEGER. Number of error bounds to return for each right hand side and each type (normwise or componentwise). See err_bnds_norm and err_bnds_comp descriptions in the Output Arguments section below. nparams INTEGER. Specifies the number of parameters set in params. If = 0, the params array is never referenced and default values are used. params REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION nparams. Specifies algorithm parameters. If an entry is less than 0.0, that entry is filled with the default value used for that parameter. Only positions up to nparams are accessed; defaults are used for higher-numbered parameters. If defaults are acceptable, you can pass nparams = 0, which prevents the source code from accessing the params argument. params(la_linrx_itref_i = 1) : Whether to perform iterative refinement or not. Default: 1.0 (for single precision flavors), 1.0D+0 (for double precision flavors). =0.0 No refinement is performed and no error bounds are computed. =1.0 Use the extra-precise refinement algorithm. (Other values are reserved for futute use.) params(la_linrx_ithresh_i = 2) : Maximum number of resudual computations allowed for fefinement. Default 10 Aggressive Set to 100 to permit convergence using approximate factorizations or factorizations other than LU. If the factorization uses a technique other than Gaussian elimination, the quarantees in err_bnds_norm and err_bnds_comp may no longer be trustworthy. params(la_linrx_cwise_i = 3) : Flag determining if the code will attempt to find a solution with a small componentwise relative error in the double-precision algorithm. Positive is true, 0.0 is false. Default: 1.0 (attempt componentwise convergence). rwork REAL for chesvxx DOUBLE PRECISION for zhesvxx. Workspace array, DIMENSION at least max(1, 3*n); used in complex flavors only. Output Parameters x COMPLEX for chesvxx DOUBLE COMPLEX for zhesvxx. Array, DIMENSION (ldx,nrhs). If info = 0, the array x contains the solution n-by-nrhs matrix X to the original system of equations. Note that A and B are modified on exit if equed ? 'N', and the solution to the equilibrated system is: inv(diag(s))*X. a If fact = 'E' and equed = 'Y', overwritten by diag(s)*A*diag(s). af If fact = 'N', af is an output argument and on exit returns the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*U**T or A = L*D*L**T. 3 Intel® Math Kernel Library Reference Manual 652 b If equed = 'N', B is not modified. If equed = 'Y', B is overwritten by diag(s)*B. s This array is an output argument if fact ? 'F'. Each element of this array is a power of the radix. See the description of s in Input Arguments section. rcond REAL for chesvxx DOUBLE PRECISION for zhesvxx. Reciprocal scaled condition number. An estimate of the reciprocal Skeel condition number of the matrix A after equilibration (if done). If rcond is less than the machine precision, in particular, if rcond = 0, the matrix is singular to working precision. Note that the error may still be small even if this number is very small and the matrix appears ill-conditioned. rpvgrw REAL for chesvxx DOUBLE PRECISION for zhesvxx. Contains the reciprocal pivot growth factor norm(A)/norm(U). The max absolute element norm is used. If this is much less than 1, the stability of the LU factorization of the (equlibrated) matrix A could be poor. This also means that the solution X, estimated condition numbers, and error bounds could be unreliable. If factorization fails with 0 < info = n, this parameter contains the reciprocal pivot growth factor for the leading info columns of A. berr REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. err_bnds_norm REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the normwise relative error, which is defined as follows: Normwise relative error in the i-th solution vector The array is indexed by the type of error information as described below. There are currently up to three pieces of information returned. The first index in err_bnds_norm(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_norm(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. This error bound should only be trusted if the previous boolean is true. LAPACK Routines: Linear Equations 3 653 err=3 Reciprocal condition number. Estimated normwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e)for zhesvxx to determine if the error estimate is "guaranteed". These reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*a, where s scales each row by a power of the radix so all absolute row sums of z are approximately 1. err_bnds_comp REAL for chesvxx DOUBLE PRECISION for zhesvxx. Array, DIMENSION (nrhs,n_err_bnds). For each right-hand side, contains information about various error bounds and condition numbers corresponding to the componentwise relative error, which is defined as follows: Componentwise relative error in the i-th solution vector: The array is indexed by the right-hand side i, on which the componentwise relative error depends, and by the type of error information as described below. There are currently up to three pieces of information returned for each right-hand side. If componentwise accuracy is nit requested (params(3) = 0.0), then err_bnds_comp is not accessed. If n_err_bnds < 3, then at most the first (:,n_err_bnds) entries are returned. The first index in err_bnds_comp(i,:) corresponds to the i-th right-hand side. The second index in err_bnds_comp(:,err) contains the follwoing three fields: err=1 "Trust/don't trust" boolean. Trust the answer if the reciprocal condition number is less than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. err=2 "Guaranteed" error bpound. The estimated forward error, almost certainly within a factor of 10 of the true error so long as the next entry is greater than the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx. This error bound should only be trusted if the previous boolean is true. err=3 Reciprocal condition number. Estimated componentwise reciprocal condition number. Compared with the threshold sqrt(n)*slamch(e) for chesvxx and sqrt(n)*dlamch(e) for zhesvxx to determine if the error estimate is "guaranteed". These 3 Intel® Math Kernel Library Reference Manual 654 reciprocal condition numbers are 1/(norm(1/ z,inf)*norm(z,inf)) for some appropriately scaled matrix Z. Let z=s*(a*diag(x)), where x is the solution for the current right-hand side and s scales each row of a*diag(x) by a power of the radix so all absolute row sums of z are approximately 1. ipiv If fact = 'N', ipiv is an output argument and on exit contains details of the interchanges and the block structure D, as determined by ssytrf for single precision flavors and dsytrf for double precision flavors. equed If fact ? 'F', then equed is an output argument. It specifies the form of equilibration that was done (see the description of equed in Input Arguments section). params If an entry is less than 0.0, that entry is filled with the default value used for that parameter, otherwise the entry is not modified. info INTEGER. If info = 0, the execution is successful. The solution to every right-hand side is guaranteed. If info = -i, the i-th parameter had an illegal value. If 0 < info = n: U(info,info) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = n+j: The solution corresponding to the j-th right-hand side is not guaranteed. The solutions corresponding to other right-hand sides k with k > j may not be guaranteed as well, but only the first such right-hand side is reported. If a small componentwise error is not requested params(3) = 0.0, then the j-th right-hand side is the first with a normwise error bound that is not guaranteed (the smallest j such that err_bnds_norm(j,1) = 0.0 or err_bnds_comp(j,1) = 0.0. See the definition of err_bnds_norm(;,1) and err_bnds_comp(;,1). To get information about all of the right-hand sides, check err_bnds_norm or err_bnds_comp. ?spsv Computes the solution to the system of linear equations with a real or complex symmetric matrix A stored in packed format, and multiple right-hand sides. Syntax Fortran 77: call sspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call dspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call cspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zspsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call spsv( ap, b [,uplo] [,ipiv] [,info] ) LAPACK Routines: Linear Equations 3 655 C: lapack_int LAPACKE_spsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the real or complex system of linear equations A*X = B, where A is an n-by-n symmetric matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UT or A = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, b REAL for sspsv DOUBLE PRECISION for dspsv COMPLEX for cspsv DOUBLE COMPLEX for zspsv. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap The block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?sptrf, stored as a packed triangular matrix in the same storage format as A. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. 3 Intel® Math Kernel Library Reference Manual 656 Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?sptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. ?spsvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a real or complex symmetric matrix A stored in packed format, and provides error bounds on the solution. Syntax Fortran 77: call sspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call dspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, iwork, info ) call cspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zspsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call spsvx( ap, b, x [,uplo] [,afp] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) LAPACK Routines: Linear Equations 3 657 C: lapack_int LAPACKE_sspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const float* ap, float* afp, lapack_int* ipiv, const float* b, lapack_int ldb, float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_dspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const double* ap, double* afp, lapack_int* ipiv, const double* b, lapack_int ldb, double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); lapack_int LAPACKE_cspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* afp, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zspsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, lapack_complex_double* afp, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a real or complex system of linear equations A*X = B, where A is a n-by-n symmetric matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?spsvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UT orA = L*D*LT, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, afp and ipiv contain the factored form of A. Arrays ap, afp, and ipiv are not modified. 3 Intel® Math Kernel Library Reference Manual 658 If fact = 'N', the matrix A is copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the symmetric matrix A, and A is factored as U*D*UT. If uplo = 'L', the array ap stores the lower triangular part of the symmetric matrix A; A is factored as L*D*LT. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, afp, b, work REAL for sspsvx DOUBLE PRECISION for dspsvx COMPLEX for cspsvx DOUBLE COMPLEX for zspsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the symmetric matrix A in packed storage (see Matrix Storage Schemes). The array afp is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UT or A = L*D*LT as computed by ?sptrf, in the same storage format as A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1, n(n +1)/2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1,3*n) for real flavors and max(1,2*n) for complex flavors. ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?sptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). iwork INTEGER. Workspace array, DIMENSION at least max(1, n); used in real flavors only. rwork REAL for cspsvx DOUBLE PRECISION for zspsvx. LAPACK Routines: Linear Equations 3 659 Workspace array, DIMENSION at least max(1, n); used in complex flavors only. Output Parameters x REAL for sspsvx DOUBLE PRECISION for dspsvx COMPLEX for cspsvx DOUBLE COMPLEX for zspsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). afp, ipiv These arrays are output arguments if fact = 'N'. See the description of afp, ipiv in Input Arguments section. rcond REAL for single precision flavors. DOUBLE PRECISION for double precision flavors. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr, berr REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Arrays, DIMENSION at least max(1, nrhs). Contain the componentwise forward and relative backward errors, respectively, for each solution vector. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine spsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector with the number of elements n. ferr Holds the vector with the number of elements nrhs. 3 Intel® Math Kernel Library Reference Manual 660 berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. ?hpsv Computes the solution to the system of linear equations with a Hermitian matrix A stored in packed format, and multiple right-hand sides. Syntax Fortran 77: call chpsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) call zhpsv( uplo, n, nrhs, ap, ipiv, b, ldb, info ) Fortran 95: call hpsv( ap, b [,uplo] [,ipiv] [,info] ) C: lapack_int LAPACKE_hpsv( int matrix_order, char uplo, lapack_int n, lapack_int nrhs, * ap, lapack_int* ipiv, * b, lapack_int ldb ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine solves for X the system of linear equations A*X = B, where A is an n-by-n Hermitian matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. The diagonal pivoting method is used to factor A as A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The factored form of A is then used to solve the system of equations A*X = B. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored: If uplo = 'U', the upper triangle of A is stored. If uplo = 'L', the lower triangle of A is stored. n INTEGER. The order of matrix A; n = 0. LAPACK Routines: Linear Equations 3 661 nrhs INTEGER. The number of right-hand sides; the number of columns in B; nrhs = 0. ap, b COMPLEX for chpsv DOUBLE COMPLEX for zhpsv. Arrays: ap(*), b(ldb,*). The dimension of ap must be at least max(1,n(n+1)/2). The array ap contains the factor U or L, as specified by uplo, in packed storage (see Matrix Storage Schemes). The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. The second dimension of b must be at least max(1,nrhs). ldb INTEGER. The leading dimension of b; ldb = max(1, n). Output Parameters ap The block-diagonal matrix D and the multipliers used to obtain the factor U (or L) from the factorization of A as computed by ?hptrf, stored as a packed triangular matrix in the same storage format as A. b If info = 0, b is overwritten by the solution matrix X. ipiv INTEGER. Array, DIMENSION at least max(1, n). Contains details of the interchanges and the block structure of D, as determined by ?hptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 block, and the i-th row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, dii is 0. The factorization has been completed, but D is exactly singular, so the solution could not be computed. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpsv interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). ipiv Holds the vector with the number of elements n. uplo Must be 'U' or 'L'. The default value is 'U'. 3 Intel® Math Kernel Library Reference Manual 662 ?hpsvx Uses the diagonal pivoting factorization to compute the solution to the system of linear equations with a Hermitian matrix A stored in packed format, and provides error bounds on the solution. Syntax Fortran 77: call chpsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) call zhpsvx( fact, uplo, n, nrhs, ap, afp, ipiv, b, ldb, x, ldx, rcond, ferr, berr, work, rwork, info ) Fortran 95: call hpsvx( ap, b, x [,uplo] [,afp] [,ipiv] [,fact] [,ferr] [,berr] [,rcond] [,info] ) C: lapack_int LAPACKE_chpsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_float* ap, lapack_complex_float* afp, lapack_int* ipiv, const lapack_complex_float* b, lapack_int ldb, lapack_complex_float* x, lapack_int ldx, float* rcond, float* ferr, float* berr ); lapack_int LAPACKE_zhpsvx( int matrix_order, char fact, char uplo, lapack_int n, lapack_int nrhs, const lapack_complex_double* ap, lapack_complex_double* afp, lapack_int* ipiv, const lapack_complex_double* b, lapack_int ldb, lapack_complex_double* x, lapack_int ldx, double* rcond, double* ferr, double* berr ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine uses the diagonal pivoting factorization to compute the solution to a complex system of linear equations A*X = B, where A is a n-by-n Hermitian matrix stored in packed format, the columns of matrix B are individual right-hand sides, and the columns of X are the corresponding solutions. Error bounds on the solution and a condition estimate are also provided. The routine ?hpsvx performs the following steps: 1. If fact = 'N', the diagonal pivoting method is used to factor the matrix A. The form of the factorization is A = U*D*UH or A = L*D*LH, where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is a Hermitian and block diagonal with 1-by-1 and 2-by-2 diagonal blocks. 2. If some di,i = 0, so that D is exactly singular, then the routine returns with info = i. Otherwise, the factored form of A is used to estimate the condition number of the matrix A. If the reciprocal of the condition number is less than machine precision, info = n+1 is returned as a warning, but the routine still goes on to solve for X and compute error bounds as described below. 3. The system of equations is solved for X using the factored form of A. 4. Iterative refinement is applied to improve the computed solution matrix and calculate error bounds and backward error estimates for it. LAPACK Routines: Linear Equations 3 663 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. fact CHARACTER*1. Must be 'F' or 'N'. Specifies whether or not the factored form of the matrix A has been supplied on entry. If fact = 'F': on entry, afp and ipiv contain the factored form of A. Arrays ap, afp, and ipiv are not modified. If fact = 'N', the matrix A is copied to afp and factored. uplo CHARACTER*1. Must be 'U' or 'L'. Indicates whether the upper or lower triangular part of A is stored and how A is factored: If uplo = 'U', the array ap stores the upper triangular part of the Hermitian matrix A, and A is factored as U*D*UH. If uplo = 'L', the array ap stores the lower triangular part of the Hermitian matrix A, and A is factored as L*D*LH. n INTEGER. The order of matrix A; n = 0. nrhs INTEGER. The number of right-hand sides, the number of columns in B; nrhs = 0. ap, afp, b, work COMPLEX for chpsvx DOUBLE COMPLEX for zhpsvx. Arrays: ap(*), afp(*), b(ldb,*), work(*). The array ap contains the upper or lower triangle of the Hermitian matrix A in packed storage (see Matrix Storage Schemes). The array afp is an input argument if fact = 'F'. It contains the block diagonal matrix D and the multipliers used to obtain the factor U or L from the factorization A = U*D*UH or A = L*D*LH as computed by ?hptrf, in the same storage format as A. The array b contains the matrix B whose columns are the right-hand sides for the systems of equations. work(*) is a workspace array. The dimension of arrays ap and afp must be at least max(1,n(n+1)/ 2); the second dimension of b must be at least max(1,nrhs); the dimension of work must be at least max(1,2*n). ldb INTEGER. The leading dimension of b; ldb = max(1, n). ipiv INTEGER. Array, DIMENSION at least max(1, n). The array ipiv is an input argument if fact = 'F'. It contains details of the interchanges and the block structure of D, as determined by ?hptrf. If ipiv(i) = k > 0, then dii is a 1-by-1 diagonal block, and the ith row and column of A was interchanged with the k-th row and column. If uplo = 'U' and ipiv(i) =ipiv(i-1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i-1, and (i-1)-th row and column of A was interchanged with the m-th row and column. If uplo = 'L' and ipiv(i) =ipiv(i+1) = -m < 0, then D has a 2- by-2 block in rows/columns i and i+1, and (i+1)-th row and column of A was interchanged with the m-th row and column. 3 Intel® Math Kernel Library Reference Manual 664 ldx INTEGER. The leading dimension of the output array x; ldx = max(1, n). rwork REAL for chpsvx DOUBLE PRECISION for zhpsvx. Workspace array, DIMENSION at least max(1, n). Output Parameters x COMPLEX for chpsvx DOUBLE COMPLEX for zhpsvx. Array, DIMENSION (ldx,*). If info = 0 or info = n+1, the array x contains the solution matrix X to the system of equations. The second dimension of x must be at least max(1,nrhs). afp, ipiv These arrays are output arguments if fact = 'N'. See the description of afp, ipiv in Input Arguments section. rcond REAL for chpsvx DOUBLE PRECISION for zhpsvx. An estimate of the reciprocal condition number of the matrix A. If rcond is less than the machine precision (in particular, if rcond = 0), the matrix is singular to working precision. This condition is indicated by a return code of info > 0. ferr REAL for chpsvx DOUBLE PRECISION for zhpsvx. Array, DIMENSION at least max(1, nrhs). Contains the estimated forward error bound for each solution vector x(j) (the j-th column of the solution matrix X). If xtrue is the true solution corresponding to x(j), ferr(j) is an estimated upper bound for the magnitude of the largest element in (x(j) - xtrue) divided by the magnitude of the largest element in x(j). The estimate is as reliable as the estimate for rcond, and is almost always a slight overestimate of the true error. berr REAL for chpsvx DOUBLE PRECISION for zhpsvx. Array, DIMENSION at least max(1, nrhs). Contains the componentwise relative backward error for each solution vector x(j), that is, the smallest relative change in any element of A or B that makes x(j) an exact solution. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, and i = n, then dii is exactly zero. The factorization has been completed, but the block diagonal matrix D is exactly singular, so the solution and error bounds could not be computed; rcond = 0 is returned. If info = i, and i = n + 1, then D is nonsingular, but rcond is less than machine precision, meaning that the matrix is singular to working precision. Nevertheless, the solution and error bounds are computed because there are a number of situations where the computed solution can be more accurate than the value of rcond would suggest. LAPACK Routines: Linear Equations 3 665 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or reconstructible arguments, see Fortran 95 Interface Conventions. Specific details for the routine hpsvx interface are as follows: ap Holds the array A of size (n*(n+1)/2). b Holds the matrix B of size (n,nrhs). x Holds the matrix X of size (n,nrhs). afp Holds the array AF of size (n*(n+1)/2). ipiv Holds the vector with the number of elements n. ferr Holds the vector with the number of elements nrhs. berr Holds the vector with the number of elements nrhs. uplo Must be 'U' or 'L'. The default value is 'U'. fact Must be 'N' or 'F'. The default value is 'N'. If fact = 'F', then both arguments af and ipiv must be present; otherwise, an error is returned. 3 Intel® Math Kernel Library Reference Manual 666 LAPACK Routines: Least Squares and Eigenvalue Problems 4 This chapter describes the Intel® Math Kernel Library implementation of routines from the LAPACK package that are used for solving linear least squares problems, eigenvalue and singular value problems, as well as performing a number of related computational tasks. Sections in this chapter include descriptions of LAPACK computational routines and driver routines. For full reference on LAPACK routines and related information see [LUG]. Least Squares Problems. A typical least squares problem is as follows: given a matrix A and a vector b, find the vector x that minimizes the sum of squares Si((Ax)i - bi)2 or, equivalently, find the vector x that minimizes the 2-norm ||Ax - b||2. In the most usual case, A is an m-by-n matrix with m = n and rank(A) = n. This problem is also referred to as finding the least squares solution to an overdetermined system of linear equations (here we have more equations than unknowns). To solve this problem, you can use the QR factorization of the matrix A (see QR Factorization). If m < n and rank(A) = m, there exist an infinite number of solutions x which exactly satisfy Ax = b, and thus minimize the norm ||Ax - b||2. In this case it is often useful to find the unique solution that minimizes ||x||2. This problem is referred to as finding the minimum-norm solution to an underdetermined system of linear equations (here we have more unknowns than equations). To solve this problem, you can use the LQ factorization of the matrix A (see LQ Factorization). In the general case you may have a rank-deficient least squares problem, with rank(A)< min(m, n): find the minimum-norm least squares solution that minimizes both ||x||2 and ||Ax - b||2. In this case (or when the rank of A is in doubt) you can use the QR factorization with pivoting or singular value decomposition (see Singular Value Decomposition). Eigenvalue Problems. The eigenvalue problems (from German eigen "own") are stated as follows: given a matrix A, find the eigenvalues ? and the corresponding eigenvectors z that satisfy the equation Az = ?z (right eigenvectors z) or the equation zHA = ?zH (left eigenvectors z). If A is a real symmetric or complex Hermitian matrix, the above two equations are equivalent, and the problem is called a symmetric eigenvalue problem. Routines for solving this type of problems are described in the sectionSymmetric Eigenvalue Problems . Routines for solving eigenvalue problems with nonsymmetric or non-Hermitian matrices are described in the sectionNonsymmetric Eigenvalue Problems. The library also includes routines that handle generalized symmetric-definite eigenvalue problems: find the eigenvalues ? and the corresponding eigenvectors x that satisfy one of the following equations: Az = ?Bz, ABz = ?z, or BAz = ?z, where A is symmetric or Hermitian, and B is symmetric positive-definite or Hermitian positive-definite. Routines for reducing these problems to standard symmetric eigenvalue problems are described in the sectionGeneralized Symmetric-Definite Eigenvalue Problems. To solve a particular problem, you usually call several computational routines. Sometimes you need to combine the routines of this chapter with other LAPACK routines described in Chapter 3 as well as with BLAS routines described in Chapter 2. 667 For example, to solve a set of least squares problems minimizing ||Ax - b||2 for all columns b of a given matrix B (where A and B are real matrices), you can call ?geqrf to form the factorization A = QR, then call ? ormqr to compute C = QHB and finally call the BLAS routine ?trsm to solve for X the system of equations RX = C. Another way is to call an appropriate driver routine that performs several tasks in one call. For example, to solve the least squares problem the driver routine ?gels can be used. WARNING LAPACK routines assume that input matrices do not contain IEEE 754 special values such as INF or NaN values. Using these special values may cause LAPACK to return unexpected results or become unstable. Starting from release 8.0, Intel MKL along with the FORTRAN 77 interface to LAPACK computational and driver routines supports also the Fortran 95 interface, which uses simplified routine calls with shorter argument lists. The syntax section of the routine description gives the calling sequence for the Fortran 95 interface, where available, immediately after the FORTRAN 77 calls. Routine Naming Conventions For each routine in this chapter, when calling it from the FORTRAN 77 program you can use the LAPACK name. LAPACK names have the structure xyyzzz, which is explained below. The initial letter x indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision The second and third letters yy indicate the matrix type and storage scheme: bb bidiagonal-block matrix bd bidiagonal matrix ge general matrix gb general band matrix hs upper Hessenberg matrix or (real) orthogonal matrix op (real) orthogonal matrix (packed storage) un (complex) unitary matrix up (complex) unitary matrix (packed storage) pt symmetric or Hermitian positive-definite tridiagonal matrix sy symmetric matrix sp symmetric matrix (packed storage) sb (real) symmetric band matrix st (real) symmetric tridiagonal matrix he Hermitian matrix hp Hermitian matrix (packed storage) hb (complex) Hermitian band matrix tr triangular or quasi-triangular matrix. The last three letters zzz indicate the computation performed, for example: qrf form the QR factorization lqf form the LQ factorization. 4 Intel® Math Kernel Library Reference Manual 668 Thus, the routine sgeqrf forms the QR factorization of general real matrices in single precision; the corresponding routine for complex matrices is cgeqrf. Names of the LAPACK computational and driver routines for the Fortran 95 interface in Intel MKL are the same as the FORTRAN 77 names but without the first letter that indicates the data type. For example, the name of the routine that forms the QR factorization of general real matrices in the Fortran 95 interface is geqrf. Handling of different data types is done through defining a specific internal parameter referring to a module block with named constants for single and double precision. For details on the design of the Fortran 95 interface for LAPACK computational and driver routines in Intel MKL and for the general information on how the optional arguments are reconstructed, see the Fortran 95 Interface Conventions in chapter 3 . Matrix Storage Schemes LAPACK routines use the following matrix storage schemes: • Full storage: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). • Packed storage scheme allows you to store symmetric, Hermitian, or triangular matrices more compactly: the upper or lower triangle of the matrix is packed by columns in a one-dimensional array. • Band storage: an m-by-n band matrix with kl sub-diagonals and ku super-diagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. In Chapters 3 and 4 , arrays that hold matrices in the packed storage have names ending in p; arrays with matrices in the band storage have names ending in b. For more information on matrix storage schemes, see "Matrix Arguments" in Appendix B . Mathematical Notation In addition to the mathematical notation used in description of BLAS and LAPACK Linear Equations routines, descriptions of the routines to solve Least Squares and Eigenvalue plroblems use the following notation: ?i Eigenvalues of the matrix A (for the definition of eigenvalues, see Eigenvalue Problems). si Singular values of the matrix A. They are equal to square roots of the eigenvalues of AHA. (For more information, see Singular Value Decomposition). ||x||2 The 2-norm of the vector x: ||x||2 = (Si|xi|2)1/2 = ||x||E . ||A||2 The 2-norm (or spectral norm) of the matrix A. ||A||2 = maxisi, ||A||22= max|x|=1(Ax·Ax). ||A||E The Euclidean norm of the matrix A: ||A||E2 = SiSj|aij|2 (for vectors, the Euclidean norm and the 2-norm are equal: ||x||E = ||x||2). q(x, y) The acute angle between vectors x and y: cos q(x, y) = |x·y| / (||x||2||y||2). Computational Routines In the sections that follow, the descriptions of LAPACK computational routines are given. These routines perform distinct computational tasks that can be used for: Orthogonal Factorizations Singular Value Decomposition Symmetric Eigenvalue Problems Generalized Symmetric-Definite Eigenvalue Problems LAPACK Routines: Least Squares and Eigenvalue Problems 4 669 Nonsymmetric Eigenvalue Problems Generalized Nonsymmetric Eigenvalue Problems Generalized Singular Value Decomposition See also the respective driver routines. Orthogonal Factorizations This section describes the LAPACK routines for the QR (RQ) and LQ (QL) factorization of matrices. Routines for the RZ factorization as well as for generalized QR and RQ factorizations are also included. QR Factorization. Assume that A is an m-by-n matrix to be factored. If m = n, the QR factorization is given by where R is an n-by-n upper triangular matrix with real diagonal elements, and Q is an m-by-m orthogonal (or unitary) matrix. You can use the QR factorization for solving the following least squares problem: minimize ||Ax - b||2 where A is a full-rank m-by-n matrix (m=n). After factoring the matrix, compute the solution x by solving Rx = (Q1)Tb. If m < n, the QR factorization is given by A = QR = Q(R1R2) where R is trapezoidal, R1 is upper triangular and R2 is rectangular. The LAPACK routines do not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. LQ Factorization LQ factorization of an m-by-n matrix A is as follows. If m = n, where L is an m-by-m lower triangular matrix with real diagonal elements, and Q is an n-by-n orthogonal (or unitary) matrix. If m > n, the LQ factorization is where L1 is an n-by-n lower triangular matrix, L2 is rectangular, and Q is an n-by-n orthogonal (or unitary) matrix. You can use the LQ factorization to find the minimum-norm solution of an underdetermined system of linear equations Ax = b where A is an m-by-n matrix of rank m (m < n). After factoring the matrix, compute the solution vector x as follows: solve Ly = b for y, and then compute x = (Q1)Hy. 4 Intel® Math Kernel Library Reference Manual 670 Table "Computational Routines for Orthogonal Factorization" lists LAPACK routines (FORTRAN 77 interface) that perform orthogonal factorization of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Orthogonal Factorization Matrix type, factorization Factorize without pivoting Factorize with pivoting Generate matrix Q Apply matrix Q general matrices, QR factorization geqrf geqrfp geqpf geqp3 orgqr ungqr ormqr unmqr general matrices, RQ factorization gerqf orgrq ungrq ormrq unmrq general matrices, LQ factorization gelqf orglq unglq ormlq unmlq general matrices, QL factorization geqlf orgql ungql ormql unmql trapezoidal matrices, RZ factorization tzrzf ormrz unmrz pair of matrices, generalized QR factorization ggqrf pair of matrices, generalized RQ factorization ggrqf ?geqrf Computes the QR factorization of a general m-by-n matrix. Syntax Fortran 77: call sgeqrf(m, n, a, lda, tau, work, lwork, info) call dgeqrf(m, n, a, lda, tau, work, lwork, info) call cgeqrf(m, n, a, lda, tau, work, lwork, info) call zgeqrf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call geqrf(a [, tau] [,info]) C: lapack_int LAPACKE_geqrf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 671 The routine forms the QR factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqrf DOUBLE PRECISION for dgeqrf COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the unitary matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqrf DOUBLE PRECISION for dgeqrf COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. 4 Intel® Math Kernel Library Reference Manual 672 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqrf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqrf (this routine) to factorize A = QR; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the least squares solution vectors x.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). See Also mkl_progress LAPACK Routines: Least Squares and Eigenvalue Problems 4 673 ?geqrfp Computes the QR factorization of a general m-by-n matrix with non-negative diagonal elements. Syntax Fortran 77: call sgeqrfp(m, n, a, lda, tau, work, lwork, info) call dgeqrfp(m, n, a, lda, tau, work, lwork, info) call cgeqrfp(m, n, a, lda, tau, work, lwork, info) call zgeqrfp(m, n, a, lda, tau, work, lwork, info) C: lapack_int LAPACKE_geqrfp( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h • C: mkl_lapacke.h Description The routine forms the QR factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqrfp DOUBLE PRECISION for dgeqrfp COMPLEX for cgeqrfp DOUBLE COMPLEX for zgeqrfp. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. 4 Intel® Math Kernel Library Reference Manual 674 See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the unitary matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. The diagonal elements of the matrix R are non-negative. tau REAL for sgeqrfp DOUBLE PRECISION for dgeqrfp COMPLEX for cgeqrfp DOUBLE COMPLEX for zgeqrfp. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqrfp interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is LAPACK Routines: Least Squares and Eigenvalue Problems 4 675 (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqrfp (this routine) to factorize A = QR; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the least squares solution vectors x.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). See Also mkl_progress ?geqpf Computes the QR factorization of a general m-by-n matrix with pivoting. Syntax Fortran 77: call sgeqpf(m, n, a, lda, jpvt, tau, work, info) call dgeqpf(m, n, a, lda, jpvt, tau, work, info) call cgeqpf(m, n, a, lda, jpvt, tau, work, rwork, info) call zgeqpf(m, n, a, lda, jpvt, tau, work, rwork, info) Fortran 95: call geqpf(a, jpvt [,tau] [,info]) C: lapack_int LAPACKE_geqpf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* jpvt, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine is deprecated and has been replaced by routine geqp3. The routine ?geqpf forms the QR factorization of a general m-by-n matrix A with column pivoting: A*P = Q*R (see Orthogonal Factorizations). Here P denotes an n-by-n permutation matrix. 4 Intel® Math Kernel Library Reference Manual 676 The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqpf DOUBLE PRECISION for dgeqpf COMPLEX for cgeqpf DOUBLE COMPLEX for zgeqpf. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work (lwork) is a workspace array. The size of the work array must be at least max(1, 3*n) for real flavors and at least max(1, n) for complex flavors. lda INTEGER. The leading dimension of a; at least max(1, m). jpvt INTEGER. Array, DIMENSION at least max(1, n). On entry, if jpvt(i) > 0, the i-th column of A is moved to the beginning of A*P before the computation, and fixed in place during the computation. If jpvt(i) = 0, the ith column of A is a free column (that is, it may be interchanged during the computation with any other free column). rwork REAL for cgeqpf DOUBLE PRECISION for zgeqpf. A workspace array, DIMENSION at least max(1, 2*n). Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqpf DOUBLE PRECISION for dgeqpf COMPLEX for cgeqpf DOUBLE COMPLEX for zgeqpf. Array, DIMENSION at least max (1, min(m, n)). Contains additional information on the matrix Q. jpvt Overwritten by details of the permutation matrix P in the factorization A*P = Q*R. More precisely, the columns of A*P are the columns of A in the following order: jpvt(1), jpvt(2), ..., jpvt(n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 677 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqpf interface are the following: a Holds the matrix A of size (m,n). jpvt Holds the vector of length n. tau Holds the vector of length min(m,n) Application Notes The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e)||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqpf (this routine) to factorize A*P = Q*R; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the permuted least squares solution vectors x; the output array jpvt specifies the permutation order.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). ?geqp3 Computes the QR factorization of a general m-by-n matrix with column pivoting using level 3 BLAS. Syntax Fortran 77: call sgeqp3(m, n, a, lda, jpvt, tau, work, lwork, info) call dgeqp3(m, n, a, lda, jpvt, tau, work, lwork, info) call cgeqp3(m, n, a, lda, jpvt, tau, work, lwork, rwork, info) call zgeqp3(m, n, a, lda, jpvt, tau, work, lwork, rwork, info) Fortran 95: call geqp3(a, jpvt [,tau] [,info]) 4 Intel® Math Kernel Library Reference Manual 678 C: lapack_int LAPACKE_geqp3( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, lapack_int* jpvt, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the QR factorization of a general m-by-n matrix A with column pivoting: A*P = Q*R (see Orthogonal Factorizations) using Level 3 BLAS. Here P denotes an n-by-n permutation matrix. Use this routine instead of geqpf for better performance. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqp3 DOUBLE PRECISION for dgeqp3 COMPLEX for cgeqp3 DOUBLE COMPLEX for zgeqp3. Arrays: a (lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; must be at least max(1, 3*n+1) for real flavors, and at least max(1, n+1) for complex flavors. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes below for details. jpvt INTEGER. Array, DIMENSION at least max(1, n). On entry, if jpvt(i) ? 0, the i-th column of A is moved to the beginning of AP before the computation, and fixed in place during the computation. If jpvt(i) = 0, the i-th column of A is a free column (that is, it may be interchanged during the computation with any other free column). rwork REAL for cgeqp3 DOUBLE PRECISION for zgeqp3. A workspace array, DIMENSION at least max(1, 2*n). Used in complex flavors only. LAPACK Routines: Least Squares and Eigenvalue Problems 4 679 Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements below the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the upper triangle is overwritten by the corresponding elements of the upper triangular matrix R. If m < n, the strictly lower triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n upper trapezoidal matrix R. tau REAL for sgeqp3 DOUBLE PRECISION for dgeqp3 COMPLEX for cgeqp3 DOUBLE COMPLEX for zgeqp3. Array, DIMENSION at least max (1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. jpvt Overwritten by details of the permutation matrix P in the factorization A*P = Q*R. More precisely, the columns of AP are the columns of A in the following order: jpvt(1), jpvt(2), ..., jpvt(n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqp3 interface are the following: a Holds the matrix A of size (m,n). jpvt Holds the vector of length n. tau Holds the vector of length min(m,n) Application Notes To solve a set of least squares problems minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?geqp3 (this routine) to factorize A*P = Q*R; ormqr to compute C = QT*B (for real matrices); unmqr to compute C = QH*B (for complex matrices); trsm (a BLAS routine) to solve R*X = C. (The columns of the computed X are the permuted least squares solution vectors x; the output array jpvt specifies the permutation order.) To compute the elements of Q explicitly, call orgqr (for real matrices) ungqr (for complex matrices). If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 680 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?orgqr Generates the real orthogonal matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call sorgqr(m, n, k, a, lda, tau, work, lwork, info) call dorgqr(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgqr(a, tau [,info]) C: lapack_int LAPACKE_orgqr( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of m-by-m orthogonal matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Use this routine after a call to sgeqrf/dgeqrf or sgeqpf/dgeqpf. Usually Q is determined from the QR factorization of an m by p matrix A with m = p. To compute the whole matrix Q, use: call ?orgqr(m, m, p, a, lda, tau, work, lwork, info) To compute the leading p columns of Q (which form an orthonormal basis in the space spanned by the columns of A): call ?orgqr(m, p, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the QR factorization of leading k columns of the matrix A: call ?orgqr(m, m, k, a, lda, tau, work, lwork, info) To compute the leading k columns of Qk (which form an orthonormal basis in the space spanned by leading k columns of the matrix A): call ?orgqr(m, k, k, a, lda, tau, work, lwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 681 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The order of the orthogonal matrix Q (m = 0). n INTEGER. The number of columns of Q to be computed (0 = n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = n). a, tau, work REAL for sorgqr DOUBLE PRECISION for dorgqr Arrays: a(lda,*) and tau(*) are the arrays returned by sgeqrf / dgeqrf or sgeqpf / dgeqpf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by n leading columns of the m-by-m orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgqr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k) Application Notes For better performance, try using lwork = n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 682 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e)|*|A||2 where e is the machine precision. The total number of floating-point operations is approximately 4*m*n*k - 2*(m + n)*k2 + (4/3)*k3. If n = k, the number is approximately (2/3)*n2*(3m - n). The complex counterpart of this routine is ungqr. ?ormqr Multiplies a real matrix by the orthogonal matrix Q of the QR factorization formed by ?geqrf or ?geqpf. Syntax Fortran 77: call sormqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormqr(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormqr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or Q T, where Q is the orthogonal matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 683 side CHARACTER*1. Must be either 'L' or 'R'. If side ='L', Q or QT is applied to C from the left. If side ='R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans ='N', the routine multiplies C by Q. If trans ='T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side ='L'; 0 = k = n if side ='R'. a, tau, c, work REAL for sgeqrf DOUBLE PRECISION for dgeqrf. Arrays: a(lda,*) and tau(*) are the arrays returned by sgeqrf / dgeqrf or sgeqpf / dgeqpf. The second dimension of a must be at least max(1, k). The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, m) if side = 'L'; lda = max(1, n) if side = 'R'. ldc INTEGER. The leading dimension of c. Constraint: ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormqr interface are the following: 4 Intel® Math Kernel Library Reference Manual 684 a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmqr. ?ungqr Generates the complex unitary matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call cungqr(m, n, k, a, lda, tau, work, lwork, info) call zungqr(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungqr(a, tau [,info]) C: lapack_int LAPACKE_ungqr( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of m-by-m unitary matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. Use this routine after a call to cgeqrf/zgeqrf or cgeqpf/zgeqpf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 685 Usually Q is determined from the QR factorization of an m by p matrix A with m = p. To compute the whole matrix Q, use: call ?ungqr(m, m, p, a, lda, tau, work, lwork, info) To compute the leading p columns of Q (which form an orthonormal basis in the space spanned by the columns of A): call ?ungqr(m, p, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the QR factorization of the leading k columns of the matrix A: call ?ungqr(m, m, k, a, lda, tau, work, lwork, info) To compute the leading k columns of Qk (which form an orthonormal basis in the space spanned by the leading k columns of the matrix A): call ?ungqr(m, k, k, a, lda, tau, work, lwork, info) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The order of the unitary matrix Q (m = 0). n INTEGER. The number of columns of Q to be computed (0 = n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = n). a, tau, work COMPLEX for cungqr DOUBLE COMPLEX for zungqr Arrays: a(lda,*) and tau(*) are the arrays returned by cgeqrf/zgeqrf or cgeqpf/zgeqpf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by n leading columns of the m-by-m unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. 4 Intel® Math Kernel Library Reference Manual 686 Specific details for the routine ungqr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 16*m*n*k - 8*(m + n)*k2 + (16/3)*k3. If n = k, the number is approximately (8/3)*n2*(3m - n). The real counterpart of this routine is orgqr. ?unmqr Multiplies a complex matrix by the unitary matrix Q of the QR factorization formed by ?geqrf. Syntax Fortran 77: call cunmqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmqr(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmqr(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmqr( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a rectangular complex matrix C by Q or QH, where Q is the unitary matrix Q of the QR factorization formed by the routines geqrf/geqrf or geqpf/geqpf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 687 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work COMPLEX for cgeqrf DOUBLE COMPLEX for zgeqrf. Arrays: a(lda,*) and tau(*) are the arrays returned by cgeqrf / zgeqrf or cgeqpf / zgeqpf. The second dimension of a must be at least max(1, k). The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, m) if side = 'L'; lda = max(1, n) if side = 'R'. ldc INTEGER. The leading dimension of c. Constraint: ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. 4 Intel® Math Kernel Library Reference Manual 688 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmqr interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormqr. ?gelqf Computes the LQ factorization of a general m-by-n matrix. Syntax Fortran 77: call sgelqf(m, n, a, lda, tau, work, lwork, info) call dgelqf(m, n, a, lda, tau, work, lwork, info) call cgelqf(m, n, a, lda, tau, work, lwork, info) call zgelqf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call gelqf(a [, tau] [,info]) LAPACK Routines: Least Squares and Eigenvalue Problems 4 689 C: lapack_int LAPACKE_gelqf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the LQ factorization of a general m-by-n matrix A (see Orthogonal Factorizations). No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgelqf DOUBLE PRECISION for dgelqf COMPLEX for cgelqf DOUBLE COMPLEX for zgelqf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the factorization data as follows: If m = n, the elements above the diagonal are overwritten by the details of the unitary (orthogonal) matrix Q, and the lower triangle is overwritten by the corresponding elements of the lower triangular matrix L. If m > n, the strictly upper triangular part is overwritten by the details of the matrix Q, and the remaining elements are overwritten by the corresponding elements of the m-by-n lower trapezoidal matrix L. 4 Intel® Math Kernel Library Reference Manual 690 tau REAL for sgelqf DOUBLE PRECISION for dgelqf COMPLEX for cgelqf DOUBLE COMPLEX for zgelqf. Array, DIMENSION at least max(1, min(m, n)). Contains additional information on the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gelqf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed factorization is the exact factorization of a matrix A + E, where ||E||2 = O(e) ||A||2. The approximate number of floating-point operations for real flavors is (4/3)n3 if m = n, (2/3)n2(3m-n) if m > n, (2/3)m2(3n-m) if m < n. The number of operations for complex flavors is 4 times greater. To find the minimum-norm solution of an underdetermined least squares problem minimizing ||A*x - b||2 for all columns b of a given matrix B, you can call the following: ?gelqf (this routine) to factorize A = L*Q; trsm (a BLAS routine) to solve L*Y = B for Y; ormlq to compute X = (Q1)T*Y (for real matrices); LAPACK Routines: Least Squares and Eigenvalue Problems 4 691 unmlq to compute X = (Q1)H*Y (for complex matrices). (The columns of the computed X are the minimum-norm solution vectors x. Here A is an m-by-n matrix with m < n; Q1 denotes the first m columns of Q). To compute the elements of Q explicitly, call orglq (for real matrices) unglq (for complex matrices). See Also mkl_progress ?orglq Generates the real orthogonal matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call sorglq(m, n, k, a, lda, tau, work, lwork, info) call dorglq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orglq(a, tau [,info]) C: lapack_int LAPACKE_orglq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of n-by-n orthogonal matrix Q of the LQ factorization formed by the routines gelqf/gelqf. Use this routine after a call to sgelqf/dgelqf. Usually Q is determined from the LQ factorization of an p-by-n matrix A with n = p. To compute the whole matrix Q, use: call ?orglq(n, n, p, a, lda, tau, work, lwork, info) To compute the leading p rows of Q, which form an orthonormal basis in the space spanned by the rows of A, use: call ?orglq(p, n, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the LQ factorization of the leading k rows of A, use: call ?orglq(n, n, k, a, lda, tau, work, lwork, info) To compute the leading k rows of Qk, which form an orthonormal basis in the space spanned by the leading k rows of A, use: call ?orgqr(k, n, k, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 692 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of Q to be computed (0 = m = n). n INTEGER. The order of the orthogonal matrix Q (n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = m). a, tau, work REAL for sorglq DOUBLE PRECISION for dorglq Arrays: a(lda,*) and tau(*) are the arrays returned by sgelqf/dgelqf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by m leading rows of the n-by-n orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orglq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 693 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 4*m*n*k - 2*(m + n)*k2 + (4/3)*k3. If m = k, the number is approximately (2/3)*m2*(3n - m). The complex counterpart of this routine is unglq. ?ormlq Multiplies a real matrix by the orthogonal matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call sormlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormlq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormlq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or Q T, where Q is the orthogonal matrix Q of the LQ factorization formed by the routine gelqf/gelqf. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. 4 Intel® Math Kernel Library Reference Manual 694 If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work REAL for sormlq DOUBLE PRECISION for dormlq. Arrays: a(lda,*) and tau(*) are arrays returned by ?gelqf. The second dimension of a must be: at least max(1, m) if side = 'L'; at least max(1, n) if side = 'R'. The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormlq interface are the following: a Holds the matrix A of size (k,m). LAPACK Routines: Least Squares and Eigenvalue Problems 4 695 tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork= -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork= -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmlq. ?unglq Generates the complex unitary matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call cunglq(m, n, k, a, lda, tau, work, lwork, info) call zunglq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call unglq(a, tau [,info]) C: lapack_int LAPACKE_unglq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of n-by-n unitary matrix Q of the LQ factorization formed by the routines gelqf/gelqf. Use this routine after a call to cgelqf/zgelqf. 4 Intel® Math Kernel Library Reference Manual 696 Usually Q is determined from the LQ factorization of an p-by-n matrix A with n < p. To compute the whole matrix Q, use: call ?unglq(n, n, p, a, lda, tau, work, lwork, info) To compute the leading p rows of Q, which form an orthonormal basis in the space spanned by the rows of A, use: call ?unglq(p, n, p, a, lda, tau, work, lwork, info) To compute the matrix Qk of the LQ factorization of the leading k rows of the matrix A, use: call ?unglq(n, n, k, a, lda, tau, work, lwork, info) To compute the leading k rows of Qk, which form an orthonormal basis in the space spanned by the leading k rows of the matrix A, use: call ?ungqr(k, n, k, a, lda, tau, work, lwork, info) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of Q to be computed (0 = m = n). n INTEGER. The order of the unitary matrix Q (n = m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (0 = k = m). a, tau, work COMPLEX for cunglq DOUBLE COMPLEX for zunglq Arrays: a(lda,*) and tau(*) are the arrays returned by sgelqf/dgelqf. The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by m leading rows of the n-by-n unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unglq interface are the following: LAPACK Routines: Least Squares and Eigenvalue Problems 4 697 a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork = m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e)*||A||2, where e is the machine precision. The total number of floating-point operations is approximately 16*m*n*k - 8*(m + n)*k2 + (16/3)*k3. If m = k, the number is approximately (8/3)*m2*(3n - m) . The real counterpart of this routine is orglq. ?unmlq Multiplies a complex matrix by the unitary matrix Q of the LQ factorization formed by ?gelqf. Syntax Fortran 77: call cunmlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmlq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmlq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmlq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QH, where Q is the unitary matrix Q of the LQ factorization formed by the routine gelqf/gelqf. 4 Intel® Math Kernel Library Reference Manual 698 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, c, tau, work COMPLEX for cunmlq DOUBLE COMPLEX for zunmlq. Arrays: a(lda,*) and tau(*) are arrays returned by ?gelqf. The second dimension of a must be: at least max(1, m) if side = 'L'; at least max(1, n) if side = 'R'. The dimension of tau must be at least max(1, k). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 699 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmlq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormlq. ?geqlf Computes the QL factorization of a general m-by-n matrix. Syntax Fortran 77: call sgeqlf(m, n, a, lda, tau, work, lwork, info) call dgeqlf(m, n, a, lda, tau, work, lwork, info) call cgeqlf(m, n, a, lda, tau, work, lwork, info) call zgeqlf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call geqlf(a [, tau] [,info]) C: lapack_int LAPACKE_geqlf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 700 • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the QL factorization of a general m-by-n matrix A. No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgeqlf DOUBLE PRECISION for dgeqlf COMPLEX for cgeqlf DOUBLE COMPLEX for zgeqlf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: if m = n, the lower triangle of the subarray a(m-n+1:m, 1:n) contains the nby- n lower triangular matrix L; if m = n, the elements on and below the (nm)- th superdiagonal contain the m-by-n lower trapezoidal matrix L; in both cases, the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of elementary reflectors. tau REAL for sgeqlf DOUBLE PRECISION for dgeqlf COMPLEX for cgeqlf DOUBLE COMPLEX for zgeqlf. Array, DIMENSION at least max(1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. LAPACK Routines: Least Squares and Eigenvalue Problems 4 701 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine geqlf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: orgql to generate matrix Q (for real matrices); ungql to generate matrix Q (for complex matrices); ormql to apply matrix Q (for real matrices); unmql to apply matrix Q (for complex matrices). See Also mkl_progress ?orgql Generates the real matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call sorgql(m, n, k, a, lda, tau, work, lwork, info) call dorgql(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgql(a, tau [,info]) C: lapack_int LAPACKE_orgql( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); 4 Intel® Math Kernel Library Reference Manual 702 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n real matrix Q with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors H(i) of order m: Q = H(k) *...* H(2)*H(1) as returned by the routines geqlf/geqlf. Use this routine after a call to sgeqlf/dgeqlf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m= 0). n INTEGER. The number of columns of the matrix Q (m= n= 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (n= k= 0). a, tau, work REAL for sorgql DOUBLE PRECISION for dorgql Arrays: a(lda,*), tau(*). On entry, the (n - k + i)th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgeqlf/dgeqlf in the last k columns of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgeqlf/dgeqlf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 703 Specific details for the routine orgql interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is ungql. ?ungql Generates the complex matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call cungql(m, n, k, a, lda, tau, work, lwork, info) call zungql(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungql(a, tau [,info]) C: lapack_int LAPACKE_ungql( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n complex matrix Q with orthonormal columns, which is defined as the last n columns of a product of k elementary reflectors H(i) of order m: Q = H(k) *...* H(2)*H(1) as returned by the routines geqlf/geqlf . Use this routine after a call to cgeqlf/zgeqlf. 4 Intel® Math Kernel Library Reference Manual 704 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m=0). n INTEGER. The number of columns of the matrix Q (m=n=0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (n=k=0). a, tau, work COMPLEX for cungql DOUBLE COMPLEX for zungql Arrays: a(lda,*), tau(*), work(lwork). On entry, the (n - k + i)th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgeqlf/zgeqlf in the last k columns of its array argument a; tau(i) must contain the scalar factor of the elementaryreflector H(i), as returned by cgeqlf/zgeqlf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungql interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 705 In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is orgql. ?ormql Multiplies a real matrix by the orthogonal matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call sormql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormql(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormql( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QT, where Q is the orthogonal matrix Q of the QL factorization formed by the routine geqlf/geqlf . Depending on the parameters side and trans, the routine ormql can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m= 0). 4 Intel® Math Kernel Library Reference Manual 706 n INTEGER. The number of columns in C (n= 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 =k=m if side = 'L'; 0 =k=n if side = 'R'. a, tau, c, work REAL for sormql DOUBLE PRECISION for dormql. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith column of a must contain the vector which defines the elementary reflector Hi, for i = 1,2,...,k, as returned by sgeqlf/dgeqlf in the last k columns of its array argument a. The second dimension of a must be at least max(1, k). tau(i) must contain the scalar factor of the elementary reflector Hi, as returned by sgeqlf/dgeqlf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; if side = 'L', lda= max(1, m); if side = 'R', lda= max(1, n). ldc INTEGER. The leading dimension of c; ldc= max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork= max(1, n) if side = 'L'; lwork= max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormql interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). LAPACK Routines: Least Squares and Eigenvalue Problems 4 707 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmql. ?unmql Multiplies a complex matrix by the unitary matrix Q of the QL factorization formed by ?geqlf. Syntax Fortran 77: call cunmql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmql(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmql(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmql( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the unitary matrix Q of the QL factorization formed by the routine geqlf/geqlf . Depending on the parameters side and trans, the routine unmql can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). 4 Intel® Math Kernel Library Reference Manual 708 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m if side = 'L'; 0 = k = n if side = 'R'. a, tau, c, work COMPLEX for cunmql DOUBLE COMPLEX for zunmql. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the i-th column of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgeqlf/zgeqlf in the last k columns of its array argument a. The second dimension of a must be at least max(1, k). tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by cgeqlf/zgeqlf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, n). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 709 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmql interface are the following: a Holds the matrix A of size (r,k). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'L'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormql. ?gerqf Computes the RQ factorization of a general m-by-n matrix. Syntax Fortran 77: call sgerqf(m, n, a, lda, tau, work, lwork, info) call dgerqf(m, n, a, lda, tau, work, lwork, info) call cgerqf(m, n, a, lda, tau, work, lwork, info) call zgerqf(m, n, a, lda, tau, work, lwork, info) Fortran 95: call gerqf(a [, tau] [,info]) C: lapack_int LAPACKE_gerqf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); 4 Intel® Math Kernel Library Reference Manual 710 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the RQ factorization of a general m-by-n matrix A. No pivoting is performed. The routine does not form the matrix Q explicitly. Instead, Q is represented as a product of min(m, n) elementary reflectors. Routines are provided to work with Q in this representation. NOTE This routine supports the Progress Routine feature. See Progress Function section for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgerqf DOUBLE PRECISION for dgerqf COMPLEX for cgerqf DOUBLE COMPLEX for zgerqf. Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; lwork = max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: if m = n, the upper triangle of the subarray a(1:m, n-m+1:n ) contains the m-by-m upper triangular matrix R; if m = n, the elements on and above the (m-n)th subdiagonal contain the mby- n upper trapezoidal matrix R; in both cases, the remaining elements, with the array tau, represent the orthogonal/unitary matrix Q as a product of min(m,n) elementary reflectors. tau REAL for sgerqf DOUBLE PRECISION for dgerqf COMPLEX for cgerqf DOUBLE COMPLEX for zgerqf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 711 Array, DIMENSION at least max (1, min(m, n)). Contains scalar factors of the elementary reflectors for the matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gerqf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,n). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: orgrq to generate matrix Q (for real matrices); ungrq to generate matrix Q (for complex matrices); ormrq to apply matrix Q (for real matrices); unmrq to apply matrix Q (for complex matrices). See Also mkl_progress ?orgrq Generates the real matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call sorgrq(m, n, k, a, lda, tau, work, lwork, info) call dorgrq(m, n, k, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 712 Fortran 95: call orgrq(a, tau [,info]) C: lapack_int LAPACKE_orgrq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates an m-by-n real matrix Q with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors H(i) of order n: Q = H(1)* H(2)*...*H(k)as returned by the routines gerqf/gerqf. Use this routine after a call to sgerqf/dgerqf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m= 0). n INTEGER. The number of columns of the matrix Q (n= m). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (m= k= 0). a, tau, work REAL for sorgrq DOUBLE PRECISION for dorgrq Arrays: a(lda,*), tau(*). On entry, the (m - k + i)-th row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgerqf/ dgerqf in the last k rows of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgerqf/dgerqf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. LAPACK Routines: Least Squares and Eigenvalue Problems 4 713 info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgrq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is ungrq. ?ungrq Generates the complex matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call cungrq(m, n, k, a, lda, tau, work, lwork, info) call zungrq(m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungrq(a, tau [,info]) C: lapack_int LAPACKE_ungrq( int matrix_order, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h 4 Intel® Math Kernel Library Reference Manual 714 Description The routine generates an m-by-n complex matrix Q with orthonormal rows, which is defined as the last m rows of a product of k elementary reflectors H(i) of order n: Q = H(1)H* H(2)H*...*H(k)H as returned by the routines gerqf/gerqf. Use this routine after a call to sgerqf/dgerqf. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix Q (m=0). n INTEGER. The number of columns of the matrix Q (n=m ). k INTEGER. The number of elementary reflectors whose product defines the matrix Q (m=k=0). a, tau, work REAL for cungrq DOUBLE PRECISION for zungrq Arrays: a(lda,*), tau(*), work(lwork). On entry, the (m - k + i)th row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by sgerqf/ dgerqf in the last k rows of its array argument a; tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by sgerqf/dgerqf; The second dimension of a must be at least max(1, n). The dimension of tau must be at least max(1, k). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; at least max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the m-by-n matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungrq interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (k). LAPACK Routines: Least Squares and Eigenvalue Problems 4 715 Application Notes For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is orgrq. ?ormrq Multiplies a real matrix by the orthogonal matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call sormrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormrq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormrq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real m-by-n matrix C by Q or QT, where Q is the real orthogonal matrix defined as a product of k elementary reflectors Hi : Q = H1 H2 ... Hk as returned by the RQ factorization routine gerqf/ gerqf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 716 side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. a, tau, c, work REAL for sormrq DOUBLE PRECISION for dormrq. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith row of a must contain the vector which defines the elementary reflector Hi, for i = 1,2,...,k, as returned by sgerqf/dgerqf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector Hi, as returned by sgerqf/dgerqf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. LAPACK Routines: Least Squares and Eigenvalue Problems 4 717 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormrq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmrq. ?unmrq Multiplies a complex matrix by the unitary matrix Q of the RQ factorization formed by ?gerqf. Syntax Fortran 77: call cunmrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmrq(side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmrq(a, tau, c [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmrq( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 4 Intel® Math Kernel Library Reference Manual 718 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the complex unitary matrix defined as a product of k elementary reflectors H(i) of order n: Q = H(1)H* H(2)H*...*H(k)Has returned by the RQ factorization routine gerqf/gerqf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. a, tau, c, work COMPLEX for cunmrq DOUBLE COMPLEX for zunmrq. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by cgerqf/zgerqf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by cgerqf/zgerqf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k) . ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. LAPACK Routines: Least Squares and Eigenvalue Problems 4 719 Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmrq interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormrq. ?tzrzf Reduces the upper trapezoidal matrix A to upper triangular form. Syntax Fortran 77: call stzrzf(m, n, a, lda, tau, work, lwork, info) call dtzrzf(m, n, a, lda, tau, work, lwork, info) call ctzrzf(m, n, a, lda, tau, work, lwork, info) call ztzrzf(m, n, a, lda, tau, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 720 Fortran 95: call tzrzf(a [, tau] [,info]) C: lapack_int LAPACKE_tzrzf( int matrix_order, lapack_int m, lapack_int n, * a, lapack_int lda, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces the m-by-n (m = n) real/complex upper trapezoidal matrix A to upper triangular form by means of orthogonal/unitary transformations. The upper trapezoidal matrix A is factored as A = (R 0)*Z, where Z is an n-by-n orthogonal/unitary matrix and R is an m-by-m upper triangular matrix. See larz that applies an elementary reflector returned by ?tzrzf to a general matrix. The ?tzrzf routine replaces the deprecated ?tzrqf routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = m). a, work REAL for stzrzf DOUBLE PRECISION for dtzrzf COMPLEX for ctzrzf DOUBLE COMPLEX for ztzrzf. Arrays: a(lda,*), work(lwork).The leading m-by-n upper trapezoidal part of the array a contains the matrix A to be factorized. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The size of the work array; lwork = max(1, m). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten on exit by the factorization data as follows: LAPACK Routines: Least Squares and Eigenvalue Problems 4 721 the leading m-by-m upper triangular part of a contains the upper triangular matrix R, and elements m +1 to n of the first m rows of a, with the array tau, represent the orthogonal matrix Z as a product of m elementary reflectors. tau REAL for stzrzf DOUBLE PRECISION for dtzrzf COMPLEX for ctzrzf DOUBLE COMPLEX for ztzrzf. Array, DIMENSION at least max (1, m). Contains scalar factors of the elementary reflectors for the matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine tzrzf interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length (m). Application Notes The factorization is obtained by Householder's method. The k-th transformation matrix, z(k), which is used to introduce zeros into the (m - k + 1)-th row of A, is given in the form where for real flavors and for complex flavors 4 Intel® Math Kernel Library Reference Manual 722 tau is a scalar and z(k) is an l-element vector. tau and z(k) are chosen to annihilate the elements of the kth row of X. The scalar tau is returned in the k-th element of tau and the vector u(k) in the k-th row of A, such that the elements of z(k) are in a(k, m+1), ..., a(k, n). The elements of r are returned in the upper triangular part of A. Z is given by Z = Z(1)*Z(2)*...*Z(m). For better performance, try using lwork =m*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Related routines include: ormrz to apply matrix Q (for real matrices) unmrz to apply matrix Q (for complex matrices). ?ormrz Multiplies a real matrix by the orthogonal matrix defined from the factorization formed by ?tzrzf. Syntax Fortran 77: call sormrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) call dormrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormrz(a, tau, c, l [, side] [,trans] [,info]) C: lapack_int LAPACKE_ormrz( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, lapack_int l, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 723 The ?ormrz routine multiplies a real m-by-n matrix C by Q or QT, where Q is the real orthogonal matrix defined as a product of k elementary reflectors H(i) of order n: Q = H(1)* H(2)*...*H(k) as returned by the factorization routine tzrzf/tzrzf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result over C). The matrix Q is of order m if side = 'L' and of order n if side = 'R'. The ?ormrz routine replaces the deprecated ?latzm routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder reflectors. Constraints: 0 = l = m, if side = 'L'; 0 = l = n, if side = 'R'. a, tau, c, work REAL for sormrz DOUBLE PRECISION for dormrz. Arrays: a(lda,*), tau(*), c(ldc,*). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by stzrzf/dtzrzf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by stzrzf/dtzrzf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k) . ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. 4 Intel® Math Kernel Library Reference Manual 724 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormrz interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The complex counterpart of this routine is unmrz. ?unmrz Multiplies a complex matrix by the unitary matrix defined from the factorization formed by ?tzrzf. LAPACK Routines: Least Squares and Eigenvalue Problems 4 725 Syntax Fortran 77: call cunmrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) call zunmrz(side, trans, m, n, k, l, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmrz(a, tau, c, l [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmrz( int matrix_order, char side, char trans, lapack_int m, lapack_int n, lapack_int k, lapack_int l, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex m-by-n matrix C by Q or QH, where Q is the unitary matrix defined as a product of k elementary reflectors H(i): Q = H(1)H* H(2)H*...*H(k)H as returned by the factorization routine tzrzf/tzrzf . Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result over C). The matrix Q is of order m if side = 'L' and of order n if side = 'R'. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. trans CHARACTER*1. Must be either 'N' or 'C'. If trans = 'N', the routine multiplies C by Q. If trans = 'C', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). k INTEGER. The number of elementary reflectors whose product defines the matrix Q. Constraints: 0 = k = m, if side = 'L'; 0 = k = n, if side = 'R'. l INTEGER. The number of columns of the matrix A containing the meaningful part of the Householder reflectors. Constraints: 0 = l = m, if side = 'L'; 0 = l = n, if side = 'R'. 4 Intel® Math Kernel Library Reference Manual 726 a, tau, c, work COMPLEX for cunmrz DOUBLE COMPLEX for zunmrz. Arrays: a(lda,*), tau(*), c(ldc,*), work(lwork). On entry, the ith row of a must contain the vector which defines the elementary reflector H(i), for i = 1,2,...,k, as returned by ctzrzf/ztzrzf in the last k rows of its array argument a. The second dimension of a must be at least max(1, m) if side = 'L', and at least max(1, n) if side = 'R'. tau(i) must contain the scalar factor of the elementary reflector H(i), as returned by ctzrzf/ztzrzf. The dimension of tau must be at least max(1, k). c(ldc,*) contains the m-by-n matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, k). ldc INTEGER. The leading dimension of c; ldc = max(1, m). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmrz interface are the following: a Holds the matrix A of size (k,m). tau Holds the vector of length (k). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (if side = 'L') or lwork = m*blocksize (if side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. LAPACK Routines: Least Squares and Eigenvalue Problems 4 727 If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The real counterpart of this routine is ormrz. ?ggqrf Computes the generalized QR factorization of two matrices. Syntax Fortran 77: call sggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call dggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call cggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) call zggqrf(n, m, p, a, lda, taua, b, ldb, taub, work, lwork, info) Fortran 95: call ggqrf(a, b [,taua] [,taub] [,info]) C: lapack_int LAPACKE_ggqrf( int matrix_order, lapack_int n, lapack_int m, lapack_int p, * a, lapack_int lda, * taua, * b, lapack_int ldb, * taub ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine forms the generalized QR factorization of an n-by-m matrix A and an n-by-p matrix B as A = Q*R, B = Q*T*Z, where Q is an n-by-n orthogonal/unitary matrix, Z is a p-by-p orthogonal/unitary matrix, and R and T assume one of the forms: or 4 Intel® Math Kernel Library Reference Manual 728 where R11 is upper triangular, and where T12 or T21 is a p-by-p upper triangular matrix. In particular, if B is square and nonsingular, the GQR factorization of A and B implicitly gives the QR factorization of B-1A as: B-1*A = ZT*(T-1*R) (for real flavors) or B-1*A = ZH*(T-1*R) (for complex flavors). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The number of rows of the matrices A and B (n = 0). m INTEGER. The number of columns in A (m = 0). p INTEGER. The number of columns in B (p = 0). a, b, work REAL for sggqrf DOUBLE PRECISION for dggqrf COMPLEX for cggqrf DOUBLE COMPLEX for zggqrf. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, m). b(ldb,*) contains the matrix B. The second dimension of b must be at least max(1, p). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). ldb INTEGER. The leading dimension of b; at least max(1, n). lwork INTEGER. The size of the work array; must be at least max(1, n, m, p). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. LAPACK Routines: Least Squares and Eigenvalue Problems 4 729 Output Parameters a, b Overwritten by the factorization data as follows: on exit, the elements on and above the diagonal of the array a contain the min(n,m)-by-m upper trapezoidal matrix R (R is upper triangular if n = m);the elements below the diagonal, with the array taua, represent the orthogonal/unitary matrix Q as a product of min(n,m) elementary reflectors ; if n = p, the upper triangle of the subarray b(1:n, p-n+1:p ) contains the nby- n upper triangular matrix T; if n > p, the elements on and above the (n-p)th subdiagonal contain the nby- p upper trapezoidal matrix T; the remaining elements, with the array taub, represent the orthogonal/unitary matrix Z as a product of elementary reflectors. taua, taub REAL for sggqrf DOUBLE PRECISION for dggqrf COMPLEX for cggqrf DOUBLE COMPLEX for zggqrf. Arrays, DIMENSION at least max (1, min(n, m)) for taua and at least max (1, min(n, p)) for taub. The array taua contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Q. The array taub contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggqrf interface are the following: a Holds the matrix A of size (n,m). b Holds the matrix B of size (n,p). taua Holds the vector of length min(n,m). taub Holds the vector of length min(n,p). Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(1)H(2)...H(k), where k = min(n,m). Each H(i) has the form H(i) = I - taua*v*vT for real flavors, or H(i) = I - taua*v*vH for complex flavors, where taua is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0, v(i) = 1. On exit, v(i+1:n) is stored in a(i+1:n, i) and taua is stored in taua(i). The matrix Z is represented as a product of elementary reflectors 4 Intel® Math Kernel Library Reference Manual 730 Z = H(1)H(2)...H(k), where k = min(n,p). Each H(i) has the form H(i) = I - taub*v*vT for real flavors, or H(i) = I - taub*v*vH for complex flavors, where taub is a real/complex scalar, and v is a real/complex vector with v(p-k+i+1:p) = 0, v(p-k+i) = 1. On exit, v(1:p-k+i-1) is stored in b(n-k+i, 1:p-k+i-1) and taub is stored in taub(i). For better performance, try using lwork = max(n,m, p)*max(nb1,nb2,nb3), where nb1 is the optimal blocksize for the QR factorization of an n-by-m matrix, nb2 is the optimal blocksize for the RQ factorization of an n-by-p matrix, and nb3 is the optimal blocksize for a call of ormqr/unmqr. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?ggrqf Computes the generalized RQ factorization of two matrices. Syntax Fortran 77: call sggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call dggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call cggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) call zggrqf (m, p, n, a, lda, taua, b, ldb, taub, work, lwork, info) Fortran 95: call ggrqf(a, b [,taua] [,taub] [,info]) C: lapack_int LAPACKE_ggrqf( int matrix_order, lapack_int m, lapack_int p, lapack_int n, * a, lapack_int lda, * taua, * b, lapack_int ldb, * taub ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description LAPACK Routines: Least Squares and Eigenvalue Problems 4 731 The routine forms the generalized RQ factorization of an m-by-n matrix A and an p-by-n matrix B as A = R*Q, B = Z*T*Q, where Q is an n-by-n orthogonal/unitary matrix, Z is a p-by-p orthogonal/unitary matrix, and R and T assume one of the forms: or where R11 or R21 is upper triangular, and or where T11 is upper triangular. In particular, if B is square and nonsingular, the GRQ factorization of A and B implicitly gives the RQ factorization of A*B-1 as: A*B-1 = (R*T-1)*ZT (for real flavors) or A*B-1 = (R*T-1)*ZH (for complex flavors). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows of the matrix A (m = 0). p INTEGER. The number of rows in B (p = 0). n INTEGER. The number of columns of the matrices A and B (n = 0). a, b, work REAL for sggrqf DOUBLE PRECISION for dggrqf COMPLEX for cggrqf DOUBLE COMPLEX for zggrqf. 4 Intel® Math Kernel Library Reference Manual 732 Arrays: a(lda,*) contains the m-by-n matrix A. The second dimension of a must be at least max(1, n). b(ldb,*) contains the p-by-n matrix B. The second dimension of b must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). ldb INTEGER. The leading dimension of b; at least max(1, p). lwork INTEGER. The size of the work array; must be at least max(1, n, m, p). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a, b Overwritten by the factorization data as follows: on exit, if m = n, the upper triangle of the subarray a(1:m, n-m+1:n ) contains the m-by-m upper triangular matrix R; if m > n, the elements on and above the (m-n)th subdiagonal contain the mby- n upper trapezoidal matrix R; the remaining elements, with the array taua, represent the orthogonal/ unitary matrix Q as a product of elementary reflectors; the elements on and above the diagonal of the array b contain the min(p,n)-by-n upper trapezoidal matrix T (T is upper triangular if p = n); the elements below the diagonal, with the array taub, represent the orthogonal/unitary matrix Z as a product of elementary reflectors. taua, taub REAL for sggrqf DOUBLE PRECISION for dggrqf COMPLEX for cggrqf DOUBLE COMPLEX for zggrqf. Arrays, DIMENSION at least max (1, min(m, n)) for taua and at least max (1, min(p, n)) for taub. The array taua contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Q. The array taub contains the scalar factors of the elementary reflectors which represent the orthogonal/unitary matrix Z. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ggrqf interface are the following: a Holds the matrix A of size (m,n). b Holds the matrix A of size (p,n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 733 taua Holds the vector of length min(m,n). taub Holds the vector of length min(p,n). Application Notes The matrix Q is represented as a product of elementary reflectors Q = H(1)H(2)...H(k), where k = min(m,n). Each H(i) has the form H(i) = I - taua*v*vT for real flavors, or H(i) = I - taua*v*vH for complex flavors, where taua is a real/complex scalar, and v is a real/complex vector with v(n-k+i+1:n) = 0, v(n-k+i) = 1. On exit, v(1:n-k+i-1) is stored in a(m-k+i,1:n-k+i-1) and taua is stored in taua(i). The matrix Z is represented as a product of elementary reflectors Z = H(1)H(2)...H(k), where k = min(p,n). Each H(i) has the form H(i) = I - taub*v*vT for real flavors, or H(i) = I - taub*v*vH for complex flavors, where taub is a real/complex scalar, and v is a real/complex vector with v(1:i-1) = 0, v(i) = 1. On exit, v(i+1:p) is stored in b(i+1:p, i) and taub is stored in taub(i). For better performance, try using lwork = max(n,m, p)*max(nb1,nb2,nb3), where nb1 is the optimal blocksize for the RQ factorization of an m-by-n matrix, nb2 is the optimal blocksize for the QR factorization of an p-by-n matrix, and nb3 is the optimal blocksize for a call of ?ormrq/?unmrq. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork= -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork= -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. Singular Value Decomposition This section describes LAPACK routines for computing the singular value decomposition (SVD) of a general m-by-n matrix A: A = USVH. In this decomposition, U and V are unitary (for complex A) or orthogonal (for real A); S is an m-by-n diagonal matrix with real diagonal elements si: s1 < s2 < ... < smin(m, n) < 0. 4 Intel® Math Kernel Library Reference Manual 734 The diagonal elements si are singular values of A. The first min(m, n) columns of the matrices U and V are, respectively, left and right singular vectors of A. The singular values and singular vectors satisfy Avi = siui and AHui = sivi where ui and vi are the i-th columns of U and V, respectively. To find the SVD of a general matrix A, call the LAPACK routine ?gebrd or ?gbbrd for reducing A to a bidiagonal matrix B by a unitary (orthogonal) transformation: A = QBPH. Then call ?bdsqr, which forms the SVD of a bidiagonal matrix: B = U1SV1 H. Thus, the sought-for SVD of A is given by A = USVH =(QU1)S(V1 HPH). Table "Computational Routines for Singular Value Decomposition (SVD)" lists LAPACK routines (FORTRAN 77 interface) that perform singular value decomposition of matrices. Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). Computational Routines for Singular Value Decomposition (SVD) Operation Real matrices Complex matrices Reduce A to a bidiagonal matrix B: A = QBPH (full storage) gebrd gebrd Reduce A to a bidiagonal matrix B: A = QBPH (band storage) gbbrd gbbrd Generate the orthogonal (unitary) matrix Q or P orgbr ungbr Apply the orthogonal (unitary) matrix Q or P ormbr unmbr Form singular value decomposition of the bidiagonal matrix B: B = USVH bdsqr bdsdc bdsqr Decision Tree: Singular Value Decomposition LAPACK Routines: Least Squares and Eigenvalue Problems 4 735 Figure "Decision Tree: Singular Value Decomposition" presents a decision tree that helps you choose the right sequence of routines for SVD, depending on whether you need singular values only or singular vectors as well, whether A is real or complex, and so on. You can use the SVD to find a minimum-norm solution to a (possibly) rank-deficient least squares problem of minimizing ||Ax - b||2. The effective rank k of the matrix A can be determined as the number of singular values which exceed a suitable threshold. The minimum-norm solution is x = Vk(Sk)-1c where Sk is the leading k-by-k submatrix of S, the matrix Vk consists of the first k columns of V = PV1, and the vector c consists of the first k elements of UHb = U1 HQHb. ?gebrd Reduces a general matrix to bidiagonal form. Syntax Fortran 77: call sgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call dgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call cgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) call zgebrd(m, n, a, lda, d, e, tauq, taup, work, lwork, info) Fortran 95: call gebrd(a [, d] [,e] [,tauq] [,taup] [,info]) C: lapack_int LAPACKE_sgebrd( int matrix_order, lapack_int m, lapack_int n, float* a, lapack_int lda, float* d, float* e, float* tauq, float* taup ); lapack_int LAPACKE_dgebrd( int matrix_order, lapack_int m, lapack_int n, double* a, lapack_int lda, double* d, double* e, double* tauq, double* taup ); lapack_int LAPACKE_cgebrd( int matrix_order, lapack_int m, lapack_int n, lapack_complex_float* a, lapack_int lda, float* d, float* e, lapack_complex_float* tauq, lapack_complex_float* taup ); lapack_int LAPACKE_zgebrd( int matrix_order, lapack_int m, lapack_int n, lapack_complex_double* a, lapack_int lda, double* d, double* e, lapack_complex_double* tauq, lapack_complex_double* taup ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a general m-by-n matrix A to a bidiagonal matrix B by an orthogonal (unitary) transformation. If m = n, the reduction is given by 4 Intel® Math Kernel Library Reference Manual 736 where B1 is an n-by-n upper diagonal matrix, Q and P are orthogonal or, for a complex A, unitary matrices; Q1 consists of the first n columns of Q. If m < n, the reduction is given by A = Q*B*PH = Q*(B10)*PH = Q1*B1*P1 H, where B1 is an m-by-m lower diagonal matrix, Q and P are orthogonal or, for a complex A, unitary matrices; P1 consists of the first m rows of P. The routine does not form the matrices Q and P explicitly, but represents them as products of elementary reflectors. Routines are provided to work with the matrices Q and P in this representation: If the matrix A is real, • to compute Q and P explicitly, call orgbr. • to multiply a general matrix by Q or P, call ormbr. If the matrix A is complex, • to compute Q and P explicitly, call ungbr. • to multiply a general matrix by Q or P, call unmbr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). a, work REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays: a(lda,*) contains the matrix A. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). lwork INTEGER. The dimension of work; at least max(1, m, n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If m = n, the diagonal and first super-diagonal of a are overwritten by the upper bidiagonal matrix B. Elements below the diagonal are overwritten by details of Q, and the remaining elements are overwritten by details of P. If m < n, the diagonal and first sub-diagonal of a are overwritten by the lower bidiagonal matrix B. Elements above the diagonal are overwritten by details of P, and the remaining elements are overwritten by details of Q. d REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. LAPACK Routines: Least Squares and Eigenvalue Problems 4 737 Array, DIMENSION at least max(1, min(m, n)). Contains the diagonal elements of B. e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n) - 1). Contains the offdiagonal elements of B. tauq, taup REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays, DIMENSION at least max (1, min(m, n)). Contain further details of the matrices Q and P. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gebrd interface are the following: a Holds the matrix A of size (m,n). d Holds the vector of length min(m,n). e Holds the vector of length min(m,n)-1. tauq Holds the vector of length min(m,n). taup Holds the vector of length min(m,n). Application Notes For better performance, try using lwork = (m + n)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrices Q, B, and P satisfy QBPH = A + E, where ||E||2 = c(n)e ||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations for real flavors is (4/3)*n2*(3*m - n) for m = n, 4 Intel® Math Kernel Library Reference Manual 738 (4/3)*m2*(3*n - m) for m < n. The number of operations for complex flavors is four times greater. If n is much less than m, it can be more efficient to first form the QR factorization of A by calling geqrf and then reduce the factor R to bidiagonal form. This requires approximately 2*n2*(m + n) floating-point operations. If m is much less than n, it can be more efficient to first form the LQ factorization of A by calling gelqf and then reduce the factor L to bidiagonal form. This requires approximately 2*m2*(m + n) floating-point operations. ?gbbrd Reduces a general band matrix to bidiagonal form. Syntax Fortran 77: call sgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, info) call dgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, info) call cgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, rwork, info) call zgbbrd(vect, m, n, ncc, kl, ku, ab, ldab, d, e, q, ldq, pt, ldpt, c, ldc, work, rwork, info) Fortran 95: call gbbrd(ab [, c] [,d] [,e] [,q] [,pt] [,kl] [,m] [,info]) C: lapack_int LAPACKE_sgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, float* ab, lapack_int ldab, float* d, float* e, float* q, lapack_int ldq, float* pt, lapack_int ldpt, float* c, lapack_int ldc ); lapack_int LAPACKE_dgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, double* ab, lapack_int ldab, double* d, double* e, double* q, lapack_int ldq, double* pt, lapack_int ldpt, double* c, lapack_int ldc ); lapack_int LAPACKE_cgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, lapack_complex_float* ab, lapack_int ldab, float* d, float* e, lapack_complex_float* q, lapack_int ldq, lapack_complex_float* pt, lapack_int ldpt, lapack_complex_float* c, lapack_int ldc ); lapack_int LAPACKE_zgbbrd( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int ncc, lapack_int kl, lapack_int ku, lapack_complex_double* ab, lapack_int ldab, double* d, double* e, lapack_complex_double* q, lapack_int ldq, lapack_complex_double* pt, lapack_int ldpt, lapack_complex_double* c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h LAPACK Routines: Least Squares and Eigenvalue Problems 4 739 Description The routine reduces an m-by-n band matrix A to upper bidiagonal matrix B: A = Q*B*PH. Here the matrices Q and P are orthogonal (for real A) or unitary (for complex A). They are determined as products of Givens rotation matrices, and may be formed explicitly by the routine if required. The routine can also update a matrix C as follows: C = QH*C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'N' or 'Q' or 'P' or 'B'. If vect = 'N', neither Q nor PH is generated. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PH. If vect = 'B', the routine generates both Q and PH. m INTEGER. The number of rows in the matrix A (m = 0). n INTEGER. The number of columns in A (n = 0). ncc INTEGER. The number of columns in C (ncc = 0). kl INTEGER. The number of sub-diagonals within the band of A (kl = 0). ku INTEGER. The number of super-diagonals within the band of A (ku = 0). ab, c, work REAL for sgbbrd DOUBLE PRECISION for dgbbrd COMPLEX for cgbbrd DOUBLE COMPLEX for zgbbrd. Arrays: ab(ldab,*) contains the matrix A in band storage (see Matrix Storage Schemes). The second dimension of a must be at least max(1, n). c(ldc,*) contains an m-by-ncc matrix C. If ncc = 0, the array c is not referenced. The second dimension of c must be at least max(1, ncc). work(*) is a workspace array. The dimension of work must be at least 2*max(m, n) for real flavors, or max(m, n) for complex flavors. ldab INTEGER. The leading dimension of the array ab (ldab = kl + ku + 1). ldq INTEGER. The leading dimension of the output array q. ldq = max(1, m) if vect = 'Q' or 'B', ldq = 1 otherwise. ldpt INTEGER. The leading dimension of the output array pt. ldpt = max(1, n) if vect = 'P' or 'B', ldpt = 1 otherwise. ldc INTEGER. The leading dimension of the array c. ldc = max(1, m) if ncc > 0; ldc = 1 if ncc = 0. rwork REAL for cgbbrd DOUBLE PRECISION for zgbbrd. A workspace array, DIMENSION at least max(m, n). Output Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 740 ab Overwritten by values generated during the reduction. d REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n)). Contains the diagonal elements of the matrix B. e REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Array, DIMENSION at least max(1, min(m, n) - 1). Contains the off-diagonal elements of B. q, pt REAL for sgebrd DOUBLE PRECISION for dgebrd COMPLEX for cgebrd DOUBLE COMPLEX for zgebrd. Arrays: q(ldq,*) contains the output m-by-m matrix Q. The second dimension of q must be at least max(1, m). p(ldpt,*) contains the output n-by-n matrix PT. The second dimension of pt must be at least max(1, n). c Overwritten by the product QH*C. c is not referenced if ncc = 0. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine gbbrd interface are the following: ab Holds the array A of size (kl+ku+1,n). c Holds the matrix C of size (m,ncc). d Holds the vector with the number of elements min(m,n). e Holds the vector with the number fo elements min(m,n)-1. q Holds the matrix Q of size (m,m). pt Holds the matrix PT of size (n,n). m If omitted, assumed m = n. kl If omitted, assumed kl = ku. ku Restored as ku = lda-kl-1. vect Restored based on the presence of arguments q and pt as follows: vect = 'B', if both q and pt are present, vect = 'Q', if q is present and pt omitted, vect = 'P', if q is omitted and pt present, vect = 'N', if both q and pt are omitted. Application Notes The computed matrices Q, B, and P satisfy Q*B*PH = A + E, where ||E||2 = c(n)e ||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. If m = n, the total number of floating-point operations for real flavors is approximately the sum of: LAPACK Routines: Least Squares and Eigenvalue Problems 4 741 6*n2*(kl + ku) if vect = 'N' and ncc = 0, 3*n2*ncc*(kl + ku - 1)/(kl + ku) if C is updated, and 3*n3*(kl + ku - 1)/(kl + ku) if either Q or PH is generated (double this if both). To estimate the number of operations for complex flavors, use the same formulas with the coefficients 20 and 10 (instead of 6 and 3). ?orgbr Generates the real orthogonal matrix Q or PT determined by ?gebrd. Syntax Fortran 77: call sorgbr(vect, m, n, k, a, lda, tau, work, lwork, info) call dorgbr(vect, m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call orgbr(a, tau [,vect] [,info]) C: lapack_int LAPACKE_orgbr( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of the orthogonal matrices Q and PT formed by the routines gebrd/ gebrd. Use this routine after a call to sgebrd/dgebrd. All valid combinations of arguments are described in Input parameters. In most cases you need the following: To compute the whole m-by-m matrix Q: call ?orgbr('Q', m, m, n, a ... ) (note that the array a must have at least m columns). To form the n leading columns of Q if m > n: call ?orgbr('Q', m, n, n, a ... ) To compute the whole n-by-n matrix PT: call ?orgbr('P', n, n, m, a ... ) (note that the array a must have at least n rows). To form the m leading rows of PT if m < n: call ?orgbr('P', m, n, m, a ... ) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. 4 Intel® Math Kernel Library Reference Manual 742 vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PT. m, n INTEGER. The number of rows (m) and columns (n) in the matrix Q or PT to be returned (m = 0, n = 0). If vect = 'Q', m = n = min(m, k). If vect = 'P', n = m = min(n, k). k If vect = 'Q', the number of columns in the original m-by-k matrix reduced by gebrd. If vect = 'P', the number of rows in the original k-by-n matrix reduced by gebrd. a REAL for sorgbr DOUBLE PRECISION for dorgbr The vectors which define the elementary reflectors, as returned by gebrd. lda INTEGER. The leading dimension of the array a. lda = max(1, m). tau REAL for sorgbr DOUBLE PRECISION for dorgbr Array, DIMENSION min (m,k) if vect = 'Q', min (n,k) if vect = 'P'. Scalar factor of the elementary reflector H(i) or G(i), which determines Q and PT as returned by gebrd in the array tauq or taup. work REAL for sorgbr DOUBLE PRECISION for dorgbr Workspace array, DIMENSION max(1, lwork). lwork INTEGER. Dimension of the array work. See Application Notes for the suggested value of lwork. If lwork = -1 then the routine performs a workspace query and calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. Output Parameters a Overwritten by the orthogonal matrix Q or PT (or the leading rows or columns thereof) as specified by vect, m, and n. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgbr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,k) where k = m, if vect = 'P', k = n, if vect = 'Q'. vect Must be 'Q' or 'P'. The default value is 'Q'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 743 Application Notes For better performance, try using lwork = min(m,n)*blocksize, where blocksize is a machinedependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The approximate numbers of floating-point operations for the cases listed in Description are as follows: To form the whole of Q: (4/3)*n*(3m2 - 3m*n + n2) if m > n; (4/3)*m3 if m = n. To form the n leading columns of Q when m > n: (2/3)*n2*(3m - n2) if m > n. To form the whole of PT: (4/3)*n3 if m = n; (4/3)*m*(3n2 - 3m*n + m2) if m < n. To form the m leading columns of PT when m < n: (2/3)*n2*(3m - n2) if m > n. The complex counterpart of this routine is ungbr. ?ormbr Multiplies an arbitrary real matrix by the real orthogonal matrix Q or PT determined by ?gebrd. Syntax Fortran 77: call sormbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call dormbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormbr(a, tau, c [,vect] [,side] [,trans] [,info]) C: lapack_int LAPACKE_ormbr( int matrix_order, char vect, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); 4 Intel® Math Kernel Library Reference Manual 744 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Given an arbitrary real matrix C, this routine forms one of the matrix products Q*C, QT*C, C*Q, C*Q,T, P*C, PT*C, C*P, C*PT, where Q and P are orthogonal matrices computed by a call to gebrd/gebrd. The routine overwrites the product on C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q or PT: If side = 'L', r = m; if side = 'R', r = n. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', then Q or QT is applied to C. If vect = 'P', then P or PT is applied to C. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', multipliers are applied to C from the left. If side = 'R', they are applied to C from the right. trans CHARACTER*1. Must be 'N' or 'T'. If trans = 'N', then Q or P is applied to C. If trans = 'T', then QT or PT is applied to C. m INTEGER. The number of rows in C. n INTEGER. The number of columns in C. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. a, c, work REAL for sormbr DOUBLE PRECISION for dormbr. Arrays: a(lda,*) is the array a as returned by ?gebrd. Its second dimension must be at least max(1, min(r,k)) for vect = 'Q', or max(1, r)) for vect = 'P'. c(ldc,*) holds the matrix C. Its second dimension must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, r) if vect = 'Q'; lda = max(1, min(r,k)) if vect = 'P'. ldc INTEGER. The leading dimension of c; ldc = max(1, m). tau REAL for sormbr DOUBLE PRECISION for dormbr. Array, DIMENSION at least max (1, min(r, k)). LAPACK Routines: Least Squares and Eigenvalue Problems 4 745 For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, C*Q,T, P*C, PT*C, C*P, or C*PT, as specified by vect, side, and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ormbr interface are the following: a Holds the matrix A of size (r,min(nq,k)) where r = nq, if vect = 'Q', r = min(nq,k), if vect = 'P', nq = m, if side = 'L', nq = n, if side = 'R', k = m, if vect = 'P', k = n, if vect = 'Q'. tau Holds the vector of length min(nq,k). c Holds the matrix C of size (m,n). vect Must be 'Q' or 'P'. The default value is 'Q'. side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. 4 Intel® Math Kernel Library Reference Manual 746 If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 2*n*k(2*m - k) if side = 'L' and m = k; 2*m*k(2*n - k) if side = 'R' and n = k; 2*m2*n if side = 'L' and m < k; 2*n2*m if side = 'R' and n < k. The complex counterpart of this routine is unmbr. ?ungbr Generates the complex unitary matrix Q or PH determined by ?gebrd. Syntax Fortran 77: call cungbr(vect, m, n, k, a, lda, tau, work, lwork, info) call zungbr(vect, m, n, k, a, lda, tau, work, lwork, info) Fortran 95: call ungbr(a, tau [,vect] [,info]) C: lapack_int LAPACKE_ungbr( int matrix_order, char vect, lapack_int m, lapack_int n, lapack_int k, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine generates the whole or part of the unitary matrices Q and PH formed by the routines gebrd/gebrd. Use this routine after a call to cgebrd/zgebrd. All valid combinations of arguments are described in Input Parameters; in most cases you need the following: To compute the whole m-by-m matrix Q, use: call ?ungbr('Q', m, m, n, a ... ) (note that the array a must have at least m columns). To form the n leading columns of Q if m > n, use: call ?ungbr('Q', m, n, n, a ... ) LAPACK Routines: Least Squares and Eigenvalue Problems 4 747 To compute the whole n-by-n matrix PH, use: call ?ungbr('P', n, n, m, a ... ) (note that the array a must have at least n rows). To form the m leading rows of PH if m < n, use: call ?ungbr('P', m, n, m, a ... ) Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', the routine generates the matrix Q. If vect = 'P', the routine generates the matrix PH. m INTEGER. The number of required rows of Q or PH. n INTEGER. The number of required columns of Q or PH. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. For vect = 'Q': k = n = m if m > k, or m = n if m = k. For vect = 'P': k = m = n if n > k, or m = n if n = k. a, work COMPLEX for cungbr DOUBLE COMPLEX for zungbr. Arrays: a(lda,*) is the array a as returned by ?gebrd. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, m). tau COMPLEX for cungbr DOUBLE COMPLEX for zungbr. For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. The dimension of tau must be at least max(1, min(m, k)) for vect = 'Q', or max(1, min(m, k)) for vect = 'P'. lwork INTEGER. The size of the work array. Constraint: lwork < max(1, min(m, n)). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the orthogonal matrix Q or PT (or the leading rows or columns thereof) as specified by vect, m, and n. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. 4 Intel® Math Kernel Library Reference Manual 748 If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungbr interface are the following: a Holds the matrix A of size (m,n). tau Holds the vector of length min(m,k) where k = m, if vect = 'P', k = n, if vect = 'Q'. vect Must be 'Q' or 'P'. The default value is 'Q'. Application Notes For better performance, try using lwork = min(m,n)*blocksize, where blocksize is a machinedependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The approximate numbers of possible floating-point operations are listed below: To compute the whole matrix Q: (16/3)n(3m2 - 3m*n + n2) if m > n; (16/3)m3 if m = n. To form the n leading columns of Q when m > n: (8/3)n2(3m - n2). To compute the whole matrix PH: (16/3)n3 if m = n; (16/3)m(3n2 - 3m*n + m2) if m < n. To form the m leading columns of PH when m < n: (8/3)n2(3m - n2) if m > n. The real counterpart of this routine is orgbr. ?unmbr Multiplies an arbitrary complex matrix by the unitary matrix Q or P determined by ?gebrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 749 Syntax Fortran 77: call cunmbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) call zunmbr(vect, side, trans, m, n, k, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmbr(a, tau, c [,vect] [,side] [,trans] [,info]) C: lapack_int LAPACKE_unmbr( int matrix_order, char vect, char side, char trans, lapack_int m, lapack_int n, lapack_int k, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description Given an arbitrary complex matrix C, this routine forms one of the matrix products Q*C, QH*C, C*Q, C*QH, P*C, PH*C, C*P, or C*PH, where Q and P are unitary matrices computed by a call to gebrd/gebrd. The routine overwrites the product on C. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q or PH: If side = 'L', r = m; if side = 'R', r = n. vect CHARACTER*1. Must be 'Q' or 'P'. If vect = 'Q', then Q or QH is applied to C. If vect = 'P', then P or PH is applied to C. side CHARACTER*1. Must be 'L' or 'R'. If side = 'L', multipliers are applied to C from the left. If side = 'R', they are applied to C from the right. trans CHARACTER*1. Must be 'N' or 'C'. If trans = 'N', then Q or P is applied to C. If trans = 'C', then QH or PH is applied to C. m INTEGER. The number of rows in C. n INTEGER. The number of columns in C. k INTEGER. One of the dimensions of A in ?gebrd: If vect = 'Q', the number of columns in A; If vect = 'P', the number of rows in A. Constraints: m = 0, n = 0, k = 0. a, c, work COMPLEX for cunmbr DOUBLE COMPLEX for zunmbr. Arrays: 4 Intel® Math Kernel Library Reference Manual 750 a(lda,*) is the array a as returned by ?gebrd. Its second dimension must be at least max(1, min(r,k)) for vect = 'Q', or max(1, r)) for vect = 'P'. c(ldc,*) holds the matrix C. Its second dimension must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a. Constraints: lda = max(1, r) if vect = 'Q'; lda = max(1, min(r,k)) if vect = 'P'. ldc INTEGER. The leading dimension of c; ldc = max(1, m). tau COMPLEX for cunmbr DOUBLE COMPLEX for zunmbr. Array, DIMENSION at least max (1, min(r, k)). For vect = 'Q', the array tauq as returned by ?gebrd. For vect = 'P', the array taup as returned by ?gebrd. lwork INTEGER. The size of the work array. lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. lwork = 1 if n=0 or m=0. For optimum performance lwork = max(1,n*nb) if side = 'L', and lwork = max(1,m*nb) if side = 'R', where nb is the optimal blocksize. (nb = 0 if m = 0 or n = 0.) If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, C*QH, P*C, PH*C, C*P, or C*PH, as specified by vect, side, and trans. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmbr interface are the following: a Holds the matrix A of size (r,min(nq,k)) where r = nq, if vect = 'Q', r = min(nq,k), if vect = 'P', nq = m, if side = 'L', nq = n, if side = 'R', k = m, if vect = 'P', k = n, if vect = 'Q'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 751 tau Holds the vector of length min(nq,k). c Holds the matrix C of size (m,n). vect Must be 'Q' or 'P'. The default value is 'Q'. side Must be 'L' or 'R'. The default value is 'L'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, use lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 8*n*k(2*m - k) if side = 'L' and m = k; 8*m*k(2*n - k) if side = 'R' and n = k; 8*m2*n if side = 'L' and m < k; 8*n2*m if side = 'R' and n < k. The real counterpart of this routine is ormbr. ?bdsqr Computes the singular value decomposition of a general matrix that has been reduced to bidiagonal form. Syntax Fortran 77: call sbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call dbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call cbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) call zbdsqr(uplo, n, ncvt, nru, ncc, d, e, vt, ldvt, u, ldu, c, ldc, work, info) Fortran 95: call rbdsqr(d, e [,vt] [,u] [,c] [,uplo] [,info]) call bdsqr(d, e [,vt] [,u] [,c] [,uplo] [,info]) 4 Intel® Math Kernel Library Reference Manual 752 C: lapack_int LAPACKE_sbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, float* d, float* e, float* vt, lapack_int ldvt, float* u, lapack_int ldu, float* c, lapack_int ldc ); lapack_int LAPACKE_dbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, double* d, double* e, double* vt, lapack_int ldvt, double* u, lapack_int ldu, double* c, lapack_int ldc ); lapack_int LAPACKE_cbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, float* d, float* e, lapack_complex_float* vt, lapack_int ldvt, lapack_complex_float* u, lapack_int ldu, lapack_complex_float* c, lapack_int ldc ); lapack_int LAPACKE_zbdsqr( int matrix_order, char uplo, lapack_int n, lapack_int ncvt, lapack_int nru, lapack_int ncc, double* d, double* e, lapack_complex_double* vt, lapack_int ldvt, lapack_complex_double* u, lapack_int ldu, lapack_complex_double* c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the singular values and, optionally, the right and/or left singular vectors from the Singular Value Decomposition (SVD) of a real n-by-n (upper or lower) bidiagonal matrix B using the implicit zero-shift QR algorithm. The SVD of B has the form B = Q*S*PH where S is the diagonal matrix of singular values, Q is an orthogonal matrix of left singular vectors, and P is an orthogonal matrix of right singular vectors. If left singular vectors are requested, this subroutine actually returns U *Q instead of Q, and, if right singular vectors are requested, this subroutine returns PH *VT instead of PH, for given real/complex input matrices U and VT. When U and VT are the orthogonal/unitary matrices that reduce a general matrix A to bidiagonal form: A = U*B*VT, as computed by ?gebrd, then A = (U*Q)*S*(PH*VT) is the SVD of A. Optionally, the subroutine may also compute QH *C for a given real/complex input matrix C. See also lasq1, lasq2, lasq3, lasq4, lasq5, lasq6 used by this routine. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', B is an upper bidiagonal matrix. If uplo = 'L', B is a lower bidiagonal matrix. n INTEGER. The order of the matrix B (n = 0). ncvt INTEGER. The number of columns of the matrix VT, that is, the number of right singular vectors (ncvt = 0). Set ncvt = 0 if no right singular vectors are required. nru INTEGER. The number of rows in U, that is, the number of left singular vectors (nru = 0). LAPACK Routines: Least Squares and Eigenvalue Problems 4 753 Set nru = 0 if no left singular vectors are required. ncc INTEGER. The number of columns in the matrix C used for computing the product QH*C (ncc = 0). Set ncc = 0 if no matrix C is supplied. d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of B. The dimension of d must be at least max(1, n). e(*) contains the (n-1) off-diagonal elements of B. The dimension of e must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, 4*n). vt, u, c REAL for sbdsqr DOUBLE PRECISION for dbdsqr COMPLEX for cbdsqr DOUBLE COMPLEX for zbdsqr. Arrays: vt(ldvt,*) contains an n-by-ncvt matrix VT. The second dimension of vt must be at least max(1, ncvt). vt is not referenced if ncvt = 0. u(ldu,*) contains an nru by n unit matrix U. The second dimension of u must be at least max(1, n). u is not referenced if nru = 0. c(ldc,*) contains the matrix C for computing the product QH*C. The second dimension of c must be at least max(1, ncc). The array is not referenced if ncc = 0. ldvt INTEGER. The leading dimension of vt. Constraints: ldvt = max(1, n) if ncvt > 0; ldvt = 1 if ncvt = 0. ldu INTEGER. The leading dimension of u. Constraint: ldu = max(1, nru). ldc INTEGER. The leading dimension of c. Constraints: ldc = max(1, n) if ncc > 0;ldc = 1 otherwise. Output Parameters d On exit, if info = 0, overwritten by the singular values in decreasing order (see info). e On exit, if info = 0, e is destroyed. See also info below. c Overwritten by the product QH*C. vt On exit, this array is overwritten by PH *VT. u On exit, this array is overwritten by U *Q . info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info > i, If ncvt = nru = ncc = 0, • info = 1, a split was marked by a positive value in e • info = 2, the current block of z not diagonalized after 30*n iterations (in the inner while loop) 4 Intel® Math Kernel Library Reference Manual 754 • info = 3, termination criterion of the outer while loop is not met (the program created more than n unreduced blocks). In all other cases when ncvt = nru = ncc = 0, the algorithm did not converge; d and e contain the elements of a bidiagonal matrix that is orthogonally similar to the input matrix B; if info = i, i elements of e have not converged to zero. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine bdsqr interface are the following: d Holds the vector of length (n). e Holds the vector of length (n). vt Holds the matrix VT of size (n, ncvt). u Holds the matrix U of size (nru,n). c Holds the matrix C of size (n,ncc). uplo Must be 'U' or 'L'. The default value is 'U'. ncvt If argument vt is present, then ncvt is equal to the number of columns in matrix VT; otherwise, ncvt is set to zero. nru If argument u is present, then nru is equal to the number of rows in matrix U; otherwise, nru is set to zero. ncc If argument c is present, then ncc is equal to the number of columns in matrix C; otherwise, ncc is set to zero. Note that two variants of Fortran 95 interface for bdsqr routine are needed because of an ambiguous choice between real and complex cases appear when vt, u, and c are omitted. Thus, the name rbdsqr is used in real cases (single or double precision), and the name bdsqr is used in complex cases (single or double precision). Application Notes Each singular value and singular vector is computed to high relative accuracy. However, the reduction to bidiagonal form (prior to calling the routine) may decrease the relative accuracy in the small singular values of the original matrix if its singular values vary widely in magnitude. If si is an exact singular value of B, and si is the corresponding computed value, then |si - si| = p*(m,n)*e*si where p(m, n) is a modestly increasing function of m and n, and e is the machine precision. If only singular values are computed, they are computed more accurately than when some singular vectors are also computed (that is, the function p(m, n) is smaller). If ui is the corresponding exact left singular vector of B, and wi is the corresponding computed left singular vector, then the angle ?(ui, wi) between them is bounded as follows: ?(ui, wi) = p(m,n)*e / min i?j(|si - sj|/|si + sj|). Here mini?j(|si - sj|/|si + sj|) is the relative gap between si and the other singular values. A similar error bound holds for the right singular vectors. LAPACK Routines: Least Squares and Eigenvalue Problems 4 755 The total number of real floating-point operations is roughly proportional to n2 if only the singular values are computed. About 6n2*nru additional operations (12n2*nru for complex flavors) are required to compute the left singular vectors and about 6n2*ncvt operations (12n2*ncvt for complex flavors) to compute the right singular vectors. ?bdsdc Computes the singular value decomposition of a real bidiagonal matrix using a divide and conquer method. Syntax Fortran 77: call sbdsdc(uplo, compq, n, d, e, u, ldu, vt, ldvt, q, iq, work, iwork, info) call dbdsdc(uplo, compq, n, d, e, u, ldu, vt, ldvt, q, iq, work, iwork, info) Fortran 95: call bdsdc(d, e [,u] [,vt] [,q] [,iq] [,uplo] [,info]) C: lapack_int LAPACKE_bdsdc( int matrix_order, char uplo, char compq, lapack_int n, * d, * e, * u, lapack_int ldu, * vt, lapack_int ldvt, * q, lapack_int* iq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes the Singular Value Decomposition (SVD) of a real n-by-n (upper or lower) bidiagonal matrix B: B = U*S*VT, using a divide and conquer method, where S is a diagonal matrix with non-negative diagonal elements (the singular values of B), and U and V are orthogonal matrices of left and right singular vectors, respectively. ?bdsdc can be used to compute all singular values, and optionally, singular vectors or singular vectors in compact form. This rotuine uses ?lasd0, ?lasd1, ?lasd2, ?lasd3, ?lasd4, ?lasd5, ?lasd6, ?lasd7, ?lasd8, ?lasd9, ? lasda, ?lasdq, ?lasdt. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', B is an upper bidiagonal matrix. If uplo = 'L', B is a lower bidiagonal matrix. compq CHARACTER*1. Must be 'N', 'P', or 'I'. If compq = 'N', compute singular values only. If compq = 'P', compute singular values and compute singular vectors in compact form. If compq = 'I', compute singular values and singular vectors. n INTEGER. The order of the matrix B (n = 0). 4 Intel® Math Kernel Library Reference Manual 756 d, e, work REAL for sbdsdc DOUBLE PRECISION for dbdsdc. Arrays: d(*) contains the n diagonal elements of the bidiagonal matrix B. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of the bidiagonal matrix B. The dimension of e must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least: max(1, 4*n), if compq = 'N'; max(1, 6*n), if compq = 'P'; max(1, 3*n2+4*n), if compq = 'I'. ldu INTEGER. The leading dimension of the output array u; ldu = 1. If singular vectors are desired, then ldu = max(1, n). ldvt INTEGER. The leading dimension of the output array vt; ldvt = 1. If singular vectors are desired, then ldvt = max(1, n). iwork INTEGER. Workspace array, dimension at least max(1, 8*n). Output Parameters d If info = 0, overwritten by the singular values of B. e On exit, e is overwritten. u, vt, q REAL for sbdsdc DOUBLE PRECISION for dbdsdc. Arrays: u(ldu,*), vt(ldvt,*), q(*). If compq = 'I', then on exit u contains the left singular vectors of the bidiagonal matrix B, unless info ? 0 (seeinfo). For other values of compq, u is not referenced. The second dimension of u must be at least max(1,n). if compq = 'I', then on exit vtT contains the right singular vectors of the bidiagonal matrix B, unless info ? 0 (seeinfo). For other values of compq, vt is not referenced. The second dimension of vt must be at least max(1,n). If compq = 'P', then on exit, if info = 0, q and iq contain the left and right singular vectors in a compact form. Specifically, q contains all the REAL (for sbdsdc) or DOUBLE PRECISION (for dbdsdc) data for singular vectors. For other values of compq, q is not referenced. See Application notes for details. iq INTEGER. Array: iq(*). If compq = 'P', then on exit, if info = 0, q and iq contain the left and right singular vectors in a compact form. Specifically, iq contains all the INTEGER data for singular vectors. For other values of compq, iq is not referenced. See Application notes for details. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the algorithm failed to compute a singular value. The update process of divide and conquer failed. LAPACK Routines: Least Squares and Eigenvalue Problems 4 757 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine bdsdc interface are the following: d Holds the vector of length n. e Holds the vector of length n. u Holds the matrix U of size (n,n). vt Holds the matrix VT of size (n,n). q Holds the vector of length (ldq), where ldq = n*(11 + 2*smlsiz + 8*int(log_2(n/(smlsiz + 1)))) and smlsiz is returned by ilaenv and is equal to the maximum size of the subproblems at the bottom of the computation tree (usually about 25). compq Restored based on the presence of arguments u, vt, q, and iq as follows: compq = 'N', if none of u, vt, q, and iq are present, compq = 'I', if both u and vt are present. Arguments u and vt must either be both present or both omitted, compq = 'P', if both q and iq are present. Arguments q and iq must either be both present or both omitted. Note that there will be an error condition if all of u, vt, q, and iq arguments are present simultaneously. See Also ?lasd0 ?lasd1 ?lasd2 ?lasd3 ?lasd4 ?lasd5 ?lasd6 ?lasd7 ?lasd8 ?lasd9 ?lasda ?lasdq ?lasdt Symmetric Eigenvalue Problems Symmetric eigenvalue problems are posed as follows: given an n-by-n real symmetric or complex Hermitian matrix A, find the eigenvalues ? and the corresponding eigenvectors z that satisfy the equation Az = ?z (or, equivalently, zHA = ?zH). In such eigenvalue problems, all n eigenvalues are real not only for real symmetric but also for complex Hermitian matrices A, and there exists an orthonormal system of n eigenvectors. If A is a symmetric or Hermitian positive-definite matrix, all eigenvalues are positive. To solve a symmetric eigenvalue problem with LAPACK, you usually need to reduce the matrix to tridiagonal form and then solve the eigenvalue problem with the tridiagonal matrix obtained. LAPACK includes routines for reducing the matrix to a tridiagonal form by an orthogonal (or unitary) similarity transformation A = QTQH as well as for solving tridiagonal symmetric eigenvalue problems. These routines (for FORTRAN 77 4 Intel® Math Kernel Library Reference Manual 758 interface) are listed in Table "Computational Routines for Solving Symmetric Eigenvalue Problems". Respective routine names in Fortran 95 interface are without the first symbol (see Routine Naming Conventions). There are different routines for symmetric eigenvalue problems, depending on whether you need all eigenvectors or only some of them or eigenvalues only, whether the matrix A is positive-definite or not, and so on. These routines are based on three primary algorithms for computing eigenvalues and eigenvectors of symmetric problems: the divide and conquer algorithm, the QR algorithm, and bisection followed by inverse iteration. The divide and conquer algorithm is generally more efficient and is recommended for computing all eigenvalues and eigenvectors. Furthermore, to solve an eigenvalue problem using the divide and conquer algorithm, you need to call only one routine. In general, more than one routine has to be called if the QR algorithm or bisection followed by inverse iteration is used. The decision tree in Figure "Decision Tree: Real Symmetric Eigenvalue Problems" will help you choose the right routine or sequence of routines for eigenvalue problems with real symmetric matrices. Figure "Decision Tree: Complex Hermitian Eigenvalue Problems" presents a similar decision tree for complex Hermitian matrices. LAPACK Routines: Least Squares and Eigenvalue Problems 4 759 Decision Tree: Real Symmetric Eigenvalue Problems 4 Intel® Math Kernel Library Reference Manual 760 Decision Tree: Complex Hermitian Eigenvalue Problems Computational Routines for Solving Symmetric Eigenvalue Problems Operation Real symmetric matrices Complex Hermitian matrices Reduce to tridiagonal form A = QTQH (full storage) sytrd syrdb hetrd herdb Reduce to tridiagonal form A = QTQH (packed storage) sptrd hptrd Reduce to tridiagonal form A = QTQH (band storage). sbtrd hbtrd Generate matrix Q (full storage) orgtr ungtr Generate matrix Q (packed storage) opgtr upgtr Apply matrix Q (full storage) ormtr unmtr Apply matrix Q (packed storage) opmtr upmtr LAPACK Routines: Least Squares and Eigenvalue Problems 4 761 Operation Real symmetric matrices Complex Hermitian matrices Find all eigenvalues of a tridiagonal matrix T sterf Find all eigenvalues and eigenvectors of a tridiagonal matrix T steqr stedc steqr stedc Find all eigenvalues and eigenvectors of a tridiagonal positive-definite matrix T. pteqr pteqr Find selected eigenvalues of a tridiagonal matrix T stebz stegr stegr Find selected eigenvectors of a tridiagonal matrix T stein stegr stein stegr Find selected eigenvalues and eigenvectors of f a real symmetric tridiagonal matrix T stemr stemr Compute the reciprocal condition numbers for the eigenvectors disna disna ?sytrd Reduces a real symmetric matrix to tridiagonal form. Syntax Fortran 77: call ssytrd(uplo, n, a, lda, d, e, tau, work, lwork, info) call dsytrd(uplo, n, a, lda, d, e, tau, work, lwork, info) Fortran 95: call sytrd(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_sytrd( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, * d, * e, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation (see Application Notes below). This routine calls latrd to reduce a real symmetric matrix to tridiagonal form by an orthogonal similarity transformation. 4 Intel® Math Kernel Library Reference Manual 762 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work REAL for ssytrd DOUBLE PRECISION for dsytrd. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if uplo = 'U', the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; if uplo = 'L', the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors. d, e, tau REAL for ssytrd DOUBLE PRECISION for dsytrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q in the first n-1 elements. tau(n) is used as workspace. The dimension of tau must be at least max(1, n). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. LAPACK Routines: Least Squares and Eigenvalue Problems 4 763 If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sytrd interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (4/3)n3. After calling this routine, you can call the following: orgtr to form the computed matrix Q explicitly ormtr to multiply a real matrix by Q. The complex counterpart of this routine is hetrd. ?syrdb Reduces a real symmetric matrix to tridiagonal form with Successive Bandwidth Reduction approach. Syntax Fortran 77: call ssyrdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) call dsyrdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h 4 Intel® Math Kernel Library Reference Manual 764 Description The routine reduces a real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT and optionally multiplies matrix Z by Q, or simply forms Q. This routine reduces a full symmetric matrix to the banded symmetric form, and then to the tridiagonal symmetric form with a Successive Bandwidth Reduction approach after Prof. C.Bischof's works (see for instance, [Bischof92]). ?syrdb is functionally close to ?sytrd routine but the tridiagonal form may differ from those obtained by ?sytrd. Unlike ?sytrd, the orthogonal matrix Q cannot be restored from the details of matrix A on exit. Input Parameters jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only A is reduced to T. If jobz = 'V', then A is reduced to T and A contains Q on exit. If jobz = 'U', then A is reduced to T and Z contains Z*Q on exit. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The bandwidth of the banded matrix B (kd = 1). a,z, work REAL for ssyrdb. DOUBLE PRECISION for dsyrdb. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). z(ldz,*), the second dimension of z must be at least max(1, n). If jobz = 'U', then the matrix z is multiplied by Q. If jobz = 'N' or 'V', then z is not referenced. work(lwork) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). ldz INTEGER. The leading dimension of z; at least max(1, n). Not referenced if jobz = 'N' lwork INTEGER. The size of the work array (lwork = (2kd+1)n+kd). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If jobz = 'V', then overwritten by Q matrix. If jobz = 'N' or 'U', then overwritten by the banded matrix B and details of the orthogonal matrix QB to reduce A to B as specified by uplo. z On exit, if jobz = 'U', then the matrix z is overwritten by Z*Q. If jobz = 'N' or 'V', then z is not referenced. d, e, tau DOUBLE PRECISION. Arrays: d(*) contains the diagonal elements of the matrix T. LAPACK Routines: Least Squares and Eigenvalue Problems 4 765 The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q. The dimension of tau must be at least max(1, n-kd-1). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Application Notes For better performance, try using lwork = n*(3*kd+3). If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. For better performance, try using kd equal to 40 if n = 2000 and 64 otherwise. Try using ?syrdb instead of ?sytrd on large matrices obtaining only eigenvalues - when no eigenvectors are needed, especially in multi-threaded environment. ?syrdb becomes faster beginning approximately with n = 1000, and much faster at larger matrices with a better scalability than ?sytrd. Avoid applying ?syrdb for computing eigenvectors due to the two-step reduction, that is, the number of operations needed to apply orthogonal transformations to Z is doubled compared to the traditional one-step reduction. In that case it is better to apply ?sytrd and ?ormtr/?orgtr to obtain tridiagonal form along with the orthogonal transformation matrix Q. ?herdb Reduces a complex Hermitian matrix to tridiagonal form with Successive Bandwidth Reduction approach. Syntax Fortran 77: call cherdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) call zherdb(jobz, uplo, n, kd, a, lda, d, e, tau, z, ldz, work, lwork, info) Include Files • FORTRAN 77: mkl_lapack.fi and mkl_lapack.h Description The routine reduces a complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QT and optionally multiplies matrix Z by Q, or simply forms Q. 4 Intel® Math Kernel Library Reference Manual 766 This routine reduces a full Hermitian matrix to the banded Hermitian form, and then to the tridiagonal symmetric form with a Successive Bandwidth Reduction approach after Prof. C.Bischof's works (see for instance, [Bischof92]). ?herdb is functionally close to ?hetrd routine but the tridiagonal form may differ from those obtained by ?hetrd. Unlike ?hetrd, the orthogonal matrix Q cannot be restored from the details of matrix A on exit. Input Parameters jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only A is reduced to T. If jobz = 'V', then A is reduced to T and A contains Q on exit. If jobz = 'U', then A is reduced to T and Z contains Z*Q on exit. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The bandwidth of the banded matrix B (kd = 1). a,z, work COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. The second dimension of a must be at least max(1, n). z(ldz,*), the second dimension of z must be at least max(1, n). If jobz = 'U', then the matrix z is multiplied by Q. If jobz = 'N' or 'V', then z is not referenced. work(lwork) is a workspace array. lda INTEGER. The leading dimension of a; at least max(1, n). ldz INTEGER. The leading dimension of z; at least max(1, n). Not referenced if jobz = 'N' lwork INTEGER. The size of the work array (lwork = (2kd+1)n+kd). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a If jobz = 'V', then overwritten by Q matrix. If jobz = 'N' or 'U', then overwritten by the banded matrix B and details of the unitary matrix QB to reduce A to B as specified by uplo. z On exit, if jobz = 'U', then the matrix z is overwritten by Z*Q . If jobz = 'N' or 'V', then z is not referenced. d, e COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the orthogonal matrix Q. LAPACK Routines: Least Squares and Eigenvalue Problems 4 767 The dimension of tau must be at least max(1, n-kd-1). tau COMPLEX for cherdb. DOUBLE COMPLEX for zherdb. Array, DIMENSION at least max(1, n-1) Stores further details of the unitary matrix QB. The dimension of tau must be at least max(1, n-kd-1). work(1) If info=0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Application Notes For better performance, try using lwork = n*(3*kd+3). If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. For better performance, try using kd equal to 40 if n = 2000 and 64 otherwise. Try using ?herdb instead of ?hetrd on large matrices obtaining only eigenvalues - when no eigenvectors are needed, especially in multi-threaded environment. ?herdb becomes faster beginning approximately with n = 1000, and much faster at larger matrices with a better scalability than ?hetrd. Avoid applying ?herdb for computing eigenvectors due to the two-step reduction, that is, the number of operations needed to apply orthogonal transformations to Z is doubled compared to the traditional one-step reduction. In that case it is better to apply ?hetrd and ?unmtr/?ungtr to obtain tridiagonal form along with the unitary transformation matrix Q. ?orgtr Generates the real orthogonal matrix Q determined by ?sytrd. Syntax Fortran 77: call sorgtr(uplo, n, a, lda, tau, work, lwork, info) call dorgtr(uplo, n, a, lda, tau, work, lwork, info) Fortran 95: call orgtr(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_orgtr( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const * tau ); 4 Intel® Math Kernel Library Reference Manual 768 Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n orthogonal matrix Q formed by sytrd when reducing a real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sytrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sytrd. n INTEGER. The order of the matrix Q (n = 0). a, tau, work REAL for sorgtr DOUBLE PRECISION for dorgtr. Arrays: a(lda,*) is the array a as returned by ?sytrd. The second dimension of a must be at least max(1, n). tau(*) is the array tau as returned by ?sytrd. The dimension of tau must be at least max(1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the orthogonal matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine orgtr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). LAPACK Routines: Least Squares and Eigenvalue Problems 4 769 uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = (n-1)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (4/3)n3. The complex counterpart of this routine is ungtr. ?ormtr Multiplies a real matrix by the real orthogonal matrix Q determined by ?sytrd. Syntax Fortran 77: call sormtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) call dormtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call ormtr(a, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_ormtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or QT, where Q is the orthogonal matrix Q formed by sytrd when reducing a real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sytrd. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). 4 Intel® Math Kernel Library Reference Manual 770 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sytrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). a, c, tau, work REAL for sormtr DOUBLE PRECISION for dormtr a(lda,*) and tau are the arrays returned by ?sytrd. The second dimension of a must be at least max(1, r). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, r). ldc INTEGER. The leading dimension of c; ldc = max(1, n). lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. LAPACK Routines: Least Squares and Eigenvalue Problems 4 771 Specific details for the routine ormtr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'T'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize for side = 'L', or lwork = m*blocksize for side = 'R', where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2. The total number of floating-point operations is approximately 2*m2*n, if side = 'L', or 2*n2*m, if side = 'R'. The complex counterpart of this routine is unmtr. ?hetrd Reduces a complex Hermitian matrix to tridiagonal form. Syntax Fortran 77: call chetrd(uplo, n, a, lda, d, e, tau, work, lwork, info) call zhetrd(uplo, n, a, lda, d, e, tau, work, lwork, info) Fortran 95: call hetrd(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_chetrd( int matrix_order, char uplo, lapack_int n, lapack_complex_float* a, lapack_int lda, float* d, float* e, lapack_complex_float* tau ); 4 Intel® Math Kernel Library Reference Manual 772 lapack_int LAPACKE_zhetrd( int matrix_order, char uplo, lapack_int n, lapack_complex_double* a, lapack_int lda, double* d, double* e, lapack_complex_double* tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided to work with Q in this representation. (They are described later in this section .) This routine calls latrd to reduce a complex Hermitian matrix A to Hermitian tridiagonal form by a unitary similarity transformation. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', a stores the upper triangular part of A. If uplo = 'L', a stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). a, work COMPLEX for chetrd DOUBLE COMPLEX for zhetrd. a(lda,*) is an array containing either upper or lower triangular part of the matrix A, as specified by uplo. If uplo = 'U', the leading n-by-n upper triangular part of a contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = 'L', the leading n-by-n lower triangular part of a contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. The second dimension of a must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a On exit, if uplo = 'U', the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; LAPACK Routines: Least Squares and Eigenvalue Problems 4 773 if uplo = 'L', the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors. d, e REAL for chetrd DOUBLE PRECISION for zhetrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau COMPLEX for chetrd DOUBLE COMPLEX for zhetrd. Array, DIMENSION at least max(1, n-1). Stores further details of the unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hetrd interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes For better performance, try using lwork =n*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If you are in doubt how much workspace to supply, use a generous value of lwork for the first run or set lwork = -1. If you choose the first option and set any of admissible lwork sizes, which is no less than the minimal value described, the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If you set lwork = -1, the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if you set lwork to less than the minimal required value and not -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (16/3)n3. 4 Intel® Math Kernel Library Reference Manual 774 After calling this routine, you can call the following: ungtr to form the computed matrix Q explicitly unmtr to multiply a complex matrix by Q. The real counterpart of this routine is sytrd. ?ungtr Generates the complex unitary matrix Q determined by ?hetrd. Syntax Fortran 77: call cungtr(uplo, n, a, lda, tau, work, lwork, info) call zungtr(uplo, n, a, lda, tau, work, lwork, info) Fortran 95: call ungtr(a, tau [,uplo] [,info]) C: lapack_int LAPACKE_ungtr( int matrix_order, char uplo, lapack_int n, * a, lapack_int lda, const * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n unitary matrix Q formed by hetrd when reducing a complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hetrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hetrd. n INTEGER. The order of the matrix Q (n = 0). a, tau, work COMPLEX for cungtr DOUBLE COMPLEX for zungtr. Arrays: a(lda,*) is the array a as returned by ?hetrd. The second dimension of a must be at least max(1, n). tau(*) is the array tau as returned by ?hetrd. The dimension of tau must be at least max(1, n-1). work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; at least max(1, n). lwork INTEGER. The size of the work array (lwork = n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 775 If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters a Overwritten by the unitary matrix Q. work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine ungtr interface are the following: a Holds the matrix A of size (n,n). tau Holds the vector of length (n-1). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes For better performance, try using lwork = (n-1)*blocksize, where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed matrix Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (16/3)n3. The real counterpart of this routine is orgtr. ?unmtr Multiplies a complex matrix by the complex unitary matrix Q determined by ?hetrd. Syntax Fortran 77: call cunmtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) 4 Intel® Math Kernel Library Reference Manual 776 call zunmtr(side, uplo, trans, m, n, a, lda, tau, c, ldc, work, lwork, info) Fortran 95: call unmtr(a, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_unmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * a, lapack_int lda, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex matrix C by Q or QH, where Q is the unitary matrix Q formed by hetrd when reducing a complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ? hetrd. Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hetrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). a, c, tau, work COMPLEX for cunmtr DOUBLE COMPLEX for zunmtr. a(lda,*) and tau are the arrays returned by ?hetrd. The second dimension of a must be at least max(1, r). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work is a workspace array, its dimension max(1, lwork). lda INTEGER. The leading dimension of a; lda = max(1, r). ldc INTEGER. The leading dimension of c; ldc = max(1, n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 777 lwork INTEGER. The size of the work array. Constraints: lwork = max(1, n) if side = 'L'; lwork = max(1, m) if side = 'R'. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. See Application Notes for the suggested value of lwork. Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). work(1) If info = 0, on exit work(1) contains the minimum value of lwork required for optimum performance. Use this lwork for subsequent runs. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine unmtr interface are the following: a Holds the matrix A of size (r,r). r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector of length (r-1). c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes For better performance, try using lwork = n*blocksize (for side = 'L') or lwork = m*blocksize (for side = 'R') where blocksize is a machine-dependent value (typically, 16 to 64) required for optimum performance of the blocked algorithm. If it is not clear how much workspace to supply, use a generous value of lwork for the first run, or set lwork = -1. In first case the routine completes the task, though probably not so fast as with a recommended workspace, and provides the recommended workspace in the first element of the corresponding array work on exit. Use this value (work(1)) for subsequent runs. If lwork = -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work). This operation is called a workspace query. Note that if lwork is less than the minimal required value and is not equal to -1, then the routine returns immediately with an error exit and does not provide any information on the recommended workspace. The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2, where e is the machine precision. 4 Intel® Math Kernel Library Reference Manual 778 The total number of floating-point operations is approximately 8*m2*n if side = 'L' or 8*n2*m if side = 'R'. The real counterpart of this routine is ormtr. ?sptrd Reduces a real symmetric matrix to tridiagonal form using packed storage. Syntax Fortran 77: call ssptrd(uplo, n, ap, d, e, tau, info) call dsptrd(uplo, n, ap, d, e, tau, info) Fortran 95: call sptrd(ap, tau [,uplo] [,info]) C: lapack_int LAPACKE_sptrd( int matrix_order, char uplo, lapack_int n, * ap, * d, * e, * tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a packed real symmetric matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation. See Application Notes below for details. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A. If uplo = 'L', ap stores the packed lower triangle of A. n INTEGER. The order of the matrix A (n = 0). ap REAL for ssptrd DOUBLE PRECISION for dsptrd. Array, DIMENSION at least max(1, n(n+1)/2). Contains either upper or lower triangle of A (as specified by uplo) in the packed form described in "Matrix Arguments" in Appendix B . Output Parameters ap Overwritten by the tridiagonal matrix T and details of the orthogonal matrix Q, as specified by uplo. LAPACK Routines: Least Squares and Eigenvalue Problems 4 779 d, e, tau REAL for ssptrd DOUBLE PRECISION for dsptrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau(*) stores further details of the matrix Q. The dimension of tau must be at least max(1, n-1). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sptrd interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n-1. uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The matrix Q is represented as a product of n-1 elementary reflectors, as follows : • If uplo = 'U', Q = H(n-1) ... H(2)H(1) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, where tau is a real/complex scalar, and v is a real/complex vector with v(i+1:n) = 0 and v(i) = 1. On exit, tau is stored in tau(i), and v(1:i-1) is stored in AP, overwriting A(1:i-1, i+1). • If uplo = 'L', Q = H(1)H(2) ... H(n-1) Each H(i) has the form H(i) = I - tau*v*vT for real flavors, or H(i) = I - tau*v*vH for complex flavors, where tau is a real/complex scalar, and v is a real/complex vector with v(1:i) = 0 and v(i+1) = 1. On exit, tau is stored in tau(i), and v(i+2:n) is stored in AP, overwriting A(i+2:n, i). The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (4/3)n3. After calling this routine, you can call the following: opgtr to form the computed matrix Q explicitly opmtr to multiply a real matrix by Q. 4 Intel® Math Kernel Library Reference Manual 780 The complex counterpart of this routine is hptrd. ?opgtr Generates the real orthogonal matrix Q determined by ?sptrd. Syntax Fortran 77: call sopgtr(uplo, n, ap, tau, q, ldq, work, info) call dopgtr(uplo, n, ap, tau, q, ldq, work, info) Fortran 95: call opgtr(ap, tau, q [,uplo] [,info]) C: lapack_int LAPACKE_opgtr( int matrix_order, char uplo, lapack_int n, const * ap, const * tau, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n orthogonal matrix Q formed by sptrd when reducing a packed real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ?sptrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ? sptrd. n INTEGER. The order of the matrix Q (n = 0). ap, tau REAL for sopgtr DOUBLE PRECISION for dopgtr. Arrays ap and tau, as returned by ?sptrd. The dimension of ap must be at least max(1, n(n+1)/2). The dimension of tau must be at least max(1, n-1). ldq INTEGER. The leading dimension of the output array q; at least max(1, n). work REAL for sopgtr DOUBLE PRECISION for dopgtr. Workspace array, DIMENSION at least max(1, n-1). Output Parameters q REAL for sopgtr DOUBLE PRECISION for dopgtr. Array, DIMENSION (ldq,*). LAPACK Routines: Least Squares and Eigenvalue Problems 4 781 Contains the computed matrix Q. The second dimension of q must be at least max(1, n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine opgtr interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (4/3)n3. The complex counterpart of this routine is upgtr. ?opmtr Multiplies a real matrix by the real orthogonal matrix Q determined by ?sptrd. Syntax Fortran 77: call sopmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) call dopmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) Fortran 95: call opmtr(ap, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_opmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * ap, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a real matrix C by Q or QT, where Q is the orthogonal matrix Q formed by sptrd when reducing a packed real symmetric matrix A to tridiagonal form: A = Q*T*QT. Use this routine after a call to ? sptrd. 4 Intel® Math Kernel Library Reference Manual 782 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QT*C, C*Q, or C*QT (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QT is applied to C from the left. If side = 'R', Q or QT is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?sptrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QT. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). ap, tau, c, work REAL for sopmtr DOUBLE PRECISION for dopmtr. ap and tau are the arrays returned by ?sptrd. The dimension of ap must be at least max(1, r(r+1)/2). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work(*) is a workspace array. The dimension of work must be at least max(1, n) if side = 'L'; max(1, m) if side = 'R'. ldc INTEGER. The leading dimension of c; ldc = max(1, n). Output Parameters c Overwritten by the product Q*C, QT*C, C*Q, or C*QT (as specified by side and trans). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine opmtr interface are the following: ap Holds the array A of size (r*(r+1)/2), where r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector with the number of elements r - 1. LAPACK Routines: Least Squares and Eigenvalue Problems 4 783 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'. The default value is 'U'. trans Must be 'N', 'C', or 'T'. The default value is 'N'. Application Notes The computed product differs from the exact product by a matrix E such that ||E||2 = O(e) ||C||2, where e is the machine precision. The total number of floating-point operations is approximately 2*m2*n if side = 'L', or 2*n2*m if side = 'R'. The complex counterpart of this routine is upmtr. ?hptrd Reduces a complex Hermitian matrix to tridiagonal form using packed storage. Syntax Fortran 77: call chptrd(uplo, n, ap, d, e, tau, info) call zhptrd(uplo, n, ap, d, e, tau, info) Fortran 95: call hptrd(ap, tau [,uplo] [,info]) C: lapack_int LAPACKE_chptrd( int matrix_order, char uplo, lapack_int n, lapack_complex_float* ap, float* d, float* e, lapack_complex_float* tau ); lapack_int LAPACKE_zhptrd( int matrix_order, char uplo, lapack_int n, lapack_complex_double* ap, double* d, double* e, lapack_complex_double* tau ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a packed complex Hermitian matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is not formed explicitly but is represented as a product of n-1 elementary reflectors. Routines are provided for working with Q in this representation (see Application Notes below). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ap stores the packed upper triangle of A. 4 Intel® Math Kernel Library Reference Manual 784 If uplo = 'L', ap stores the packed lower triangle of A. n INTEGER. The order of the matrix A (n = 0). ap COMPLEX for chptrd DOUBLE COMPLEX for zhptrd. Array, DIMENSION at least max(1, n(n+1)/2). Contains either upper or lower triangle of A (as specified by uplo) in the packed form described in "Matrix Arguments" in Appendix B . Output Parameters ap Overwritten by the tridiagonal matrix T and details of the orthogonal matrix Q, as specified by uplo. d, e REAL for chptrd DOUBLE PRECISION for zhptrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). tau COMPLEX for chptrd DOUBLE COMPLEX for zhptrd. Arrays, DIMENSION at least max(1, n-1). Contains further details of the orthogonal matrix Q. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hptrd interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. uplo Must be 'U' or 'L'. The default value is 'U'. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The approximate number of floating-point operations is (16/3)n3. After calling this routine, you can call the following: upgtr to form the computed matrix Q explicitly upmtr to multiply a complex matrix by Q. The real counterpart of this routine is sptrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 785 ?upgtr Generates the complex unitary matrix Q determined by ?hptrd. Syntax Fortran 77: call cupgtr(uplo, n, ap, tau, q, ldq, work, info) call zupgtr(uplo, n, ap, tau, q, ldq, work, info) Fortran 95: call upgtr(ap, tau, q [,uplo] [,info]) C: lapack_int LAPACKE_upgtr( int matrix_order, char uplo, lapack_int n, const * ap, const * tau, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine explicitly generates the n-by-n unitary matrix Q formed by hptrd when reducing a packed complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hptrd. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ? hptrd. n INTEGER. The order of the matrix Q (n = 0). ap, tau COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Arrays ap and tau, as returned by ?hptrd. The dimension of ap must be at least max(1, n(n+1)/2). The dimension of tau must be at least max(1, n-1). ldq INTEGER. The leading dimension of the output array q; at least max(1, n). work COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Workspace array, DIMENSION at least max(1, n-1). Output Parameters q COMPLEX for cupgtr DOUBLE COMPLEX for zupgtr. Array, DIMENSION (ldq,*). Contains the computed matrix Q. 4 Intel® Math Kernel Library Reference Manual 786 The second dimension of q must be at least max(1, n). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine upgtr interface are the following: ap Holds the array A of size (n*(n+1)/2). tau Holds the vector with the number of elements n - 1. q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. Application Notes The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e), where e is the machine precision. The approximate number of floating-point operations is (16/3)n3. The real counterpart of this routine is opgtr. ?upmtr Multiplies a complex matrix by the unitary matrix Q determined by ?hptrd. Syntax Fortran 77: call cupmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) call zupmtr(side, uplo, trans, m, n, ap, tau, c, ldc, work, info) Fortran 95: call upmtr(ap, tau, c [,side] [,uplo] [,trans] [,info]) C: lapack_int LAPACKE_upmtr( int matrix_order, char side, char uplo, char trans, lapack_int m, lapack_int n, const * ap, const * tau, * c, lapack_int ldc ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine multiplies a complex matrix C by Q or QH, where Q is the unitary matrix formed by hptrd when reducing a packed complex Hermitian matrix A to tridiagonal form: A = Q*T*QH. Use this routine after a call to ?hptrd. LAPACK Routines: Least Squares and Eigenvalue Problems 4 787 Depending on the parameters side and trans, the routine can form one of the matrix products Q*C, QH*C, C*Q, or C*QH (overwriting the result on C). Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. In the descriptions below, r denotes the order of Q: If side = 'L', r = m; if side = 'R', r = n. side CHARACTER*1. Must be either 'L' or 'R'. If side = 'L', Q or QH is applied to C from the left. If side = 'R', Q or QH is applied to C from the right. uplo CHARACTER*1. Must be 'U' or 'L'. Use the same uplo as supplied to ?hptrd. trans CHARACTER*1. Must be either 'N' or 'T'. If trans = 'N', the routine multiplies C by Q. If trans = 'T', the routine multiplies C by QH. m INTEGER. The number of rows in the matrix C (m = 0). n INTEGER. The number of columns in C (n = 0). ap, tau, c, COMPLEX for cupmtr DOUBLE COMPLEX for zupmtr. ap and tau are the arrays returned by ?hptrd. The dimension of ap must be at least max(1, r(r+1)/2). The dimension of tau must be at least max(1, r-1). c(ldc,*) contains the matrix C. The second dimension of c must be at least max(1, n) work(*) is a workspace array. The dimension of work must be at least max(1, n) if side = 'L'; max(1, m) if side = 'R'. ldc INTEGER. The leading dimension of c; ldc = max(1, n). Output Parameters c Overwritten by the product Q*C, QH*C, C*Q, or C*QH (as specified by side and trans). info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine upmtr interface are the following: ap Holds the array A of size (r*(r+1)/2), where r = m if side = 'L'. r = n if side = 'R'. tau Holds the vector with the number of elements n - 1. 4 Intel® Math Kernel Library Reference Manual 788 c Holds the matrix C of size (m,n). side Must be 'L' or 'R'. The default value is 'L'. uplo Must be 'U' or 'L'.The default value is 'U'. trans Must be 'N' or 'C'. The default value is 'N'. Application Notes The computed product differs from the exact product by a matrix E such that ||E||2 = O(e)*||C||2, where e is the machine precision. The total number of floating-point operations is approximately 8*m2*n if side = 'L' or 8*n2*m if side = 'R'. The real counterpart of this routine is opmtr. ?sbtrd Reduces a real symmetric band matrix to tridiagonal form. Syntax Fortran 77: call ssbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) call dsbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) Fortran 95: call sbtrd(ab[, q] [,vect] [,uplo] [,info]) C: lapack_int LAPACKE_sbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, * ab, lapack_int ldab, * d, * e, * q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a real symmetric band matrix A to symmetric tridiagonal form T by an orthogonal similarity transformation: A = Q*T*QT. The orthogonal matrix Q is determined as a product of Givens rotations. If required, the routine can also form the matrix Q explicitly. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'V' or 'N'. If vect = 'V', the routine returns the explicit matrix Q. If vect = 'N', the routine does not return Q. LAPACK Routines: Least Squares and Eigenvalue Problems 4 789 uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n=0). kd INTEGER. The number of super- or sub-diagonals in A (kd=0). ab, q, work REAL for ssbtrd DOUBLE PRECISION for dsbtrd. ab (ldab,*) is an array containing either upper or lower triangular part of the matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). q (ldq,*) is an array. If vect = 'U', the q array must contain an n-by-n matrix X. If vect = 'N' or 'V', the q parameter need not be set. The second dimension of q must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, n). ldab INTEGER. The leading dimension of ab; at least kd+1. ldq INTEGER. The leading dimension of q. Constraints: ldq = max(1, n) if vect = 'V'; ldq = 1 if vect = 'N'. Output Parameters ab On exit, the diagonal elements of the array ab are overwritten by the diagonal elements of the tridiagonal matrix T. If kd > 0, the elements on the first superdiagonal (if uplo = 'U') or the first subdiagonal (if uplo = 'L') are ovewritten by the off-diagonal elements of T. The rest of ab is overwritten by values generated during the reduction. d, e, q REAL for ssbtrd DOUBLE PRECISION for dsbtrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). q(ldq,*) is not referenced if vect = 'N'. If vect = 'V', q contains the n-by-n matrix Q. The second dimension of q must be: at least max(1, n) if vect = 'V'; at least 1 if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sbtrd interface are the following: 4 Intel® Math Kernel Library Reference Manual 790 ab Holds the array A of size (kd+1,n). q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect If omitted, this argument is restored based on the presence of argument q as follows: vect = 'V', if q is present, vect = 'N', if q is omitted. If present, vect must be equal to 'V' or 'U' and the argument q must also be present. Note that there will be an error condition if vect is present and q omitted. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A+E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The computed matrix Q differs from an exactly orthogonal matrix by a matrix E such that ||E||2 = O(e). The total number of floating-point operations is approximately 6n2*kd if vect = 'N', with 3n3*(kd-1)/kd additional operations if vect = 'V'. The complex counterpart of this routine is hbtrd. ?hbtrd Reduces a complex Hermitian band matrix to tridiagonal form. Syntax Fortran 77: call chbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) call zhbtrd(vect, uplo, n, kd, ab, ldab, d, e, q, ldq, work, info) Fortran 95: call hbtrd(ab [, q] [,vect] [,uplo] [,info]) C: lapack_int LAPACKE_chbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, lapack_complex_float* ab, lapack_int ldab, float* d, float* e, lapack_complex_float* q, lapack_int ldq ); lapack_int LAPACKE_zhbtrd( int matrix_order, char vect, char uplo, lapack_int n, lapack_int kd, lapack_complex_double* ab, lapack_int ldab, double* d, double* e, lapack_complex_double* q, lapack_int ldq ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine reduces a complex Hermitian band matrix A to symmetric tridiagonal form T by a unitary similarity transformation: A = Q*T*QH. The unitary matrix Q is determined as a product of Givens rotations. If required, the routine can also form the matrix Q explicitly. LAPACK Routines: Least Squares and Eigenvalue Problems 4 791 Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. vect CHARACTER*1. Must be 'V' or 'N'. If vect = 'V', the routine returns the explicit matrix Q. If vect = 'N', the routine does not return Q. uplo CHARACTER*1. Must be 'U' or 'L'. If uplo = 'U', ab stores the upper triangular part of A. If uplo = 'L', ab stores the lower triangular part of A. n INTEGER. The order of the matrix A (n = 0). kd INTEGER. The number of super- or sub-diagonals in A (kd = 0). ab, work COMPLEX for chbtrd DOUBLE COMPLEX for zhbtrd. ab (ldab,*) is an array containing either upper or lower triangular part of the matrix A (as specified by uplo) in band storage format. The second dimension of ab must be at least max(1, n). work(*) is a workspace array. The dimension of work must be at least max(1, n). ldab INTEGER. The leading dimension of ab; at least kd+1. ldq INTEGER. The leading dimension of q. Constraints: ldq = max(1, n) if vect = 'V'; ldq = 1 if vect = 'N'. Output Parameters ab On exit, the diagonal elements of the array ab are overwritten by the diagonal elements of the tridiagonal matrix T. If kd > 0, the elements on the first superdiagonal (if uplo = 'U') or the first subdiagonal (if uplo = 'L') are ovewritten by the off-diagonal elements of T. The rest of ab is overwritten by values generated during the reduction. d, e REAL for chbtrd DOUBLE PRECISION for zhbtrd. Arrays: d(*) contains the diagonal elements of the matrix T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). q COMPLEX for chbtrd DOUBLE COMPLEX for zhbtrd. Array, DIMENSION (ldq,*). If vect = 'N', q is not referenced. If vect = 'V', q contains the n-by-n matrix Q. The second dimension of q must be: at least max(1, n) if vect = 'V'; at least 1 if vect = 'N'. info INTEGER. If info = 0, the execution is successful. If info = -i, the ith parameter had an illegal value. 4 Intel® Math Kernel Library Reference Manual 792 Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine hbtrd interface are the following: ab Holds the array A of size (kd+1,n). q Holds the matrix Q of size (n,n). uplo Must be 'U' or 'L'. The default value is 'U'. vect If omitted, this argument is restored based on the presence of argument q as follows: vect = 'V', if q is present, vect = 'N', if q is omitted. If present, vect must be equal to 'V' or 'U' and the argument q must also be present. Note that there will be an error condition if vect is present and q omitted. Note that diagonal (d) and off-diagonal (e) elements of the matrix T are omitted because they are kept in the matrix A on exit. Application Notes The computed matrix T is exactly similar to a matrix A + E, where ||E||2 = c(n)*e*||A||2, c(n) is a modestly increasing function of n, and e is the machine precision. The computed matrix Q differs from an exactly unitary matrix by a matrix E such that ||E||2 = O(e). The total number of floating-point operations is approximately 20n2*kd if vect = 'N', with 10n3*(kd-1)/ kd additional operations if vect = 'V'. The real counterpart of this routine is sbtrd. ?sterf Computes all eigenvalues of a real symmetric tridiagonal matrix using QR algorithm. Syntax Fortran 77: call ssterf(n, d, e, info) call dsterf(n, d, e, info) Fortran 95: call sterf(d, e [,info]) C: lapack_int LAPACKE_sterf( lapack_int n, * d, * e ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues of a real symmetric tridiagonal matrix T (which can be obtained by reducing a symmetric or Hermitian matrix to tridiagonal form). The routine uses a square-root-free variant of the QR algorithm. LAPACK Routines: Least Squares and Eigenvalue Problems 4 793 If you need not only the eigenvalues but also the eigenvectors, call steqr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. n INTEGER. The order of the matrix T (n = 0). d, e REAL for ssterf DOUBLE PRECISION for dsterf. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). Output Parameters d The n eigenvalues in ascending order, unless info > 0. See also info. e On exit, the array is overwritten; see info. info INTEGER. If info = 0, the execution is successful. If info = i, the algorithm failed to find all the eigenvalues after 30n iterations: i off-diagonal elements have not converged to zero. On exit, d and e contain, respectively, the diagonal and off-diagonal elements of a tridiagonal matrix orthogonally similar to T. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine sterf interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If ?i is an exact eigenvalue, and mi is the corresponding computed value, then |µi - ?i| = c(n)*e*||T||2 where c(n) is a modestly increasing function of n. The total number of floating-point operations depends on how rapidly the algorithm converges. Typically, it is about 14n2. 4 Intel® Math Kernel Library Reference Manual 794 ?steqr Computes all eigenvalues and eigenvectors of a symmetric or Hermitian matrix reduced to tridiagonal form (QR algorithm). Syntax Fortran 77: call ssteqr(compz, n, d, e, z, ldz, work, info) call dsteqr(compz, n, d, e, z, ldz, work, info) call csteqr(compz, n, d, e, z, ldz, work, info) call zsteqr(compz, n, d, e, z, ldz, work, info) Fortran 95: call rsteqr(d, e [,z] [,compz] [,info]) call steqr(d, e [,z] [,compz] [,info]) C: lapack_int LAPACKE_ssteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, float* z, lapack_int ldz ); lapack_int LAPACKE_dsteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, double* z, lapack_int ldz ); lapack_int LAPACKE_csteqr( int matrix_order, char compz, lapack_int n, float* d, float* e, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zsteqr( int matrix_order, char compz, lapack_int n, double* d, double* e, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and (optionally) all the eigenvectors of a real symmetric tridiagonal matrix T. In other words, the routine can compute the spectral factorization: T = Z*?*ZT. Here ? is a diagonal matrix whose diagonal elements are the eigenvalues ?i; Z is an orthogonal matrix whose columns are eigenvectors. Thus, T*zi = ?i*zi for i = 1, 2, ..., n. The routine normalizes the eigenvectors so that ||zi||2 = 1. You can also use the routine for computing the eigenvalues and eigenvectors of an arbitrary real symmetric (or complex Hermitian) matrix A reduced to tridiagonal form T: A = Q*T*QH. In this case, the spectral factorization is as follows: A = Q*T*QH = (Q*Z)*?*(Q*Z)H. Before calling ?steqr, you must reduce A to tridiagonal form and generate the explicit matrix Q by calling the following routines: for real matrices: for complex matrices: full storage ?sytrd, ?orgtr ?hetrd, ?ungtr LAPACK Routines: Least Squares and Eigenvalue Problems 4 795 for real matrices: for complex matrices: packed storage ?sptrd, ?opgtr ?hptrd, ?upgtr band storage ?sbtrd (vect='V') ?hbtrd (vect='V') If you need eigenvalues only, it's more efficient to call sterf. If T is positive-definite, pteqr can compute small eigenvalues more accurately than ?steqr. To solve the problem by a single call, use one of the divide and conquer routines stevd, syevd, spevd, or sbevd for real symmetric matrices or heevd, hpevd, or hbevd for complex Hermitian matrices. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix T. If compz = 'V', the routine computes the eigenvalues and eigenvectors of A (and the array z must contain the matrix Q on entry). n INTEGER. The order of the matrix T (n = 0). d, e, work REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of T. The dimension of d must be at least max(1, n). e(*) contains the off-diagonal elements of T. The dimension of e must be at least max(1, n-1). work(*) is a workspace array. The dimension of work must be: at least 1 if compz = 'N'; at least max(1, 2*n-2) if compz = 'V' or 'I'. z REAL for ssteqr DOUBLE PRECISION for dsteqr COMPLEX for csteqr DOUBLE COMPLEX for zsteqr. Array, DIMENSION (ldz, *) If compz = 'N' or 'I', z need not be set. If vect = 'V', z must contain the n-by-n matrix Q. The second dimension of z must be: at least 1 if compz = 'N'; at least max(1, n) if compz = 'V' or 'I'. work (lwork) is a workspace array. ldz INTEGER. The leading dimension of z. Constraints: ldz = 1 if compz = 'N'; ldz = max(1, n) if compz = 'V' or 'I'. Output Parameters d The n eigenvalues in ascending order, unless info > 0. See also info. 4 Intel® Math Kernel Library Reference Manual 796 e On exit, the array is overwritten; see info. z If info = 0, contains the n orthonormal eigenvectors, stored by columns. (The i-th column corresponds to the ith eigenvalue.) info INTEGER. If info = 0, the execution is successful. If info = i, the algorithm failed to find all the eigenvalues after 30n iterations: i off-diagonal elements have not converged to zero. On exit, d and e contain, respectively, the diagonal and off-diagonal elements of a tridiagonal matrix orthogonally similar to T. If info = -i, the i-th parameter had an illegal value. Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine steqr interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). z Holds the matrix Z of size (n,n). compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Note that two variants of Fortran 95 interface for steqr routine are needed because of an ambiguous choice between real and complex cases appear when z is omitted. Thus, the name rsteqr is used in real cases (single or double precision), and the name steqr is used in complex cases (single or double precision). Application Notes The computed eigenvalues and eigenvectors are exact for a matrix T+E such that ||E||2 = O(e)*||T||2, where e is the machine precision. If ?i is an exact eigenvalue, and µi is the corresponding computed value, then |µi - ?i| = c(n)*e*||T||2 where c(n) is a modestly increasing function of n. If zi is the corresponding exact eigenvector, and wi is the corresponding computed vector, then the angle ?(zi, wi) between them is bounded as follows: ?(zi, wi) = c(n)*e*||T||2 / mini?j|?i - ?j|. The total number of floating-point operations depends on how rapidly the algorithm converges. Typically, it is about 24n2 if compz = 'N'; 7n3 (for complex flavors, 14n3) if compz = 'V' or 'I'. LAPACK Routines: Least Squares and Eigenvalue Problems 4 797 ?stemr Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call dstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call cstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) call zstemr(jobz, range, n, d, e, vl, vu, il, iu, m, w, z, ldz, nzc, isuppz, tryrac, work, lwork, iwork, liwork, info) C: lapack_int LAPACKE_sstemr( int matrix_order, char jobz, char range, lapack_int n, const float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, lapack_int* m, float* w, float* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_dstemr( int matrix_order, char jobz, char range, lapack_int n, const double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, lapack_int* m, double* w, double* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_cstemr( int matrix_order, char jobz, char range, lapack_int n, const float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); lapack_int LAPACKE_zstemr( int matrix_order, char jobz, char range, lapack_int n, const double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int nzc, lapack_int* isuppz, lapack_logical* tryrac ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix T. Any such unreduced matrix has a well defined set of pairwise different real eigenvalues, the corresponding real eigenvectors are pairwise orthogonal. The spectrum may be computed either completely or partially by specifying either an interval (vl,vu] or a range of indices il:iu for the desired eigenvalues. Depending on the number of desired eigenvalues, these are computed either by bisection or the dqds algorithm. Numerically orthogonal eigenvectors are computed by the use of various suitable L*D*LT factorizations near clusters of close eigenvalues (referred to as RRRs, Relatively Robust Representations). An informal sketch of the algorithm follows. 4 Intel® Math Kernel Library Reference Manual 798 For each unreduced block (submatrix) of T, a. Compute T - sigma*I = L*D*LT, so that L and D define all the wanted eigenvalues to high relative accuracy. This means that small relative changes in the entries of L and D cause only small relative changes in the eigenvalues and eigenvectors. The standard (unfactored) representation of the tridiagonal matrix T does not have this property in general. b. Compute the eigenvalues to suitable accuracy. If the eigenvectors are desired, the algorithm attains full accuracy of the computed eigenvalues only right before the corresponding vectors have to be computed, see steps c and d. c. For each cluster of close eigenvalues, select a new shift close to the cluster, find a new factorization, and refine the shifted eigenvalues to suitable accuracy. d. For each eigenvalue with a large enough relative separation compute the corresponding eigenvector by forming a rank revealing twisted factorization. Go back to step c for any clusters that remain. For more details, see: [Dhillon04], [Dhillon04-02], [Dhillon97] The routine works only on machines which follow IEEE-754 floating-point standard in their handling of infinities and NaNs (NaN stands for "not a number"). This permits the use of efficient inner loops avoiding a check for zero divisors. LAPACK routines can be used to reduce a complex Hermitean matrix to real symmetric tridiagonal form. (Any complex Hermitean tridiagonal matrix has real values on its diagonal and potentially complex numbers on its off-diagonals. By applying a similarity transform with an appropriate diagonal matrix diag(1,e{i \phy_1}, ..., e{i \phy_{n-1}}), the complex Hermitean matrix can be transformed into a real symmetric matrix and complex arithmetic can be entirely avoided.) While the eigenvectors of the real symmetric tridiagonal matrix are real, the eigenvectors of original complex Hermitean matrix have complex entries in general. Since LAPACK drivers overwrite the matrix data with the eigenvectors, zstemr accepts complex workspace to facilitate interoperability with zunmtr or zupmtr. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. jobz CHARACTER*1. Must be 'N' or 'V'. If jobz = 'N', then only eigenvalues are computed. If jobz = 'V', then eigenvalues and eigenvectors are computed. range CHARACTER*1. Must be 'A' or 'V' or 'I'. If range = 'A', the routine computes all eigenvalues. If range = 'V', the routine computes all eigenvalues in the half-open interval: (vl, vu]. If range = 'I', the routine computes eigenvalues with indices il to iu. n INTEGER. The order of the matrix T (n=0). d REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). Contains n diagonal elements of the tridiagonal matrix T. e REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n-1). Contains (n-1) off-diagonal elements of the tridiagonal matrix T in elements 1 to n-1 of e. e(n) need not be set on input, but is used internally as workspace. vl, vu REAL for single precision flavors LAPACK Routines: Least Squares and Eigenvalue Problems 4 799 DOUBLE PRECISION for double precision flavors. If range = 'V', the lower and upper bounds of the interval to be searched for eigenvalues. Constraint: vl0. If range = 'A' or 'V', il and iu are not referenced. ldz INTEGER. The leading dimension of the output array z. if jobz = 'V', then ldz = max(1, n); ldz = 1 otherwise. nzc INTEGER. The number of eigenvectors to be held in the array z. If range = 'A', then nzc=max(1, n); If range = 'V', then nzc is greater than or equal to the number of eigenvalues in the half-open interval: (vl, vu]. If range = 'I', then nzc=il+iu+1. This value is returned as the first entry of the array z, and no error message related to nzc is issued by the routine xerbla. tryrac LOGICAL. If tryrac = .TRUE., it indicates that the code should check whether the tridiagonal matrix defines its eigenvalues to high relative accuracy. If so, the code uses relative-accuracy preserving algorithms that might be (a bit) slower depending on the matrix. If the matrix does not define its eigenvalues to high relative accuracy, the code can uses possibly faster algorithms. If tryrac = .FALSE., the code is not required to guarantee relatively accurate eigenvalues and can use the fastest possible techniques. work REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Workspace array, DIMENSION (lwork). lwork INTEGER. The dimension of the array work, lwork = max(1, 18*n). If lwork=-1, then a workspace query is assumed; the routine only calculates the optimal size of the work array, returns this value as the first entry of the work array, and no error message related to lwork is issued by xerbla. iwork INTEGER. Workspace array, DIMENSION (liwork). liwork INTEGER. The dimension of the array iwork. lwork=max(1, 10*n) if the eigenvectors are desired, and lwork=max(1, 8*n) if only the eigenvalues are to be computed. If liwork=-1, then a workspace query is assumed; the routine only calculates the optimal size of the iwork array, returns this value as the first entry of the iwork array, and no error message related to liwork is issued by xerbla. 4 Intel® Math Kernel Library Reference Manual 800 Output Parameters e On exit, the array e is overwritten. m INTEGER. The total number of eigenvalues found, 0=m=n. If range = 'A', then m=n, and if If range = 'I', then m=iu-il+1. w REAL for single precision flavors DOUBLE PRECISION for double precision flavors. Array, DIMENSION (n). The first m elements contain the selected eigenvalues in ascending order. z REAL for sstemr DOUBLE PRECISION for dstemr COMPLEX for cstemr DOUBLE COMPLEX for zstemr. Array z(ldz, *), the second dimension of z must be at least max(1, m). If jobz = 'V', and info = 0, then the first m columns of z contain the orthonormal eigenvectors of the matrix T corresponding to the selected eigenvalues, with the i-th column of z holding the eigenvector associated with w(i). If jobz = 'N', then z is not referenced. Note: you must ensure that at least max(1,m) columns are supplied in the array z ; if range = 'V', the exact value of m is not known in advance and an can be computed with a workspace query by setting nzc=-1, see description of the parameter nzc. isuppz INTEGER. Array, DIMENSION (2*max(1, m)). The support of the eigenvectors in z, that is the indices indicating the nonzero elements in z. The i-th computed eigenvector is nonzero only in elements isuppz(2*i-1) through isuppz(2*i). This is relevant in the case when the matrix is split. isuppz is only accessed when jobz = 'V' and n>0. tryrac On exit, TRUE. tryrac is set to .FALSE. if the matrix does not define its eigenvalues to high relative accuracy. work(1) On exit, if info = 0, then work(1) returns the optimal (and minimal) size of lwork. iwork(1) On exit, if info = 0, then iwork(1) returns the optimal size of liwork. info INTEGER. If = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = 1, internal error in ?larre occurred, if info = 2, internal error in ?larrv occurred. ?stedc Computes all eigenvalues and eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method. LAPACK Routines: Least Squares and Eigenvalue Problems 4 801 Syntax Fortran 77: call sstedc(compz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) call dstedc(compz, n, d, e, z, ldz, work, lwork, iwork, liwork, info) call cstedc(compz, n, d, e, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) call zstedc(compz, n, d, e, z, ldz, work, lwork, rwork, lrwork, iwork, liwork, info) Fortran 95: call rstedc(d, e [,z] [,compz] [,info]) call stedc(d, e [,z] [,compz] [,info]) C: lapack_int LAPACKE_sstedc( int matrix_order, char compz, lapack_int n, float* d, float* e, float* z, lapack_int ldz ); lapack_int LAPACKE_dstedc( int matrix_order, char compz, lapack_int n, double* d, double* e, double* z, lapack_int ldz ); lapack_int LAPACKE_cstedc( int matrix_order, char compz, lapack_int n, float* d, float* e, lapack_complex_float* z, lapack_int ldz ); lapack_int LAPACKE_zstedc( int matrix_order, char compz, lapack_int n, double* d, double* e, lapack_complex_double* z, lapack_int ldz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes all the eigenvalues and (optionally) all the eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method. The eigenvectors of a full or band real symmetric or complex Hermitian matrix can also be found if sytrd/hetrd or sptrd/hptrd or sbtrd/hbtrd has been used to reduce this matrix to tridiagonal form. See also laed0, laed1, laed2, laed3, laed4, laed5, laed6, laed7, laed8, laed9, and laeda used by this function. Input Parameters The data types are given for the Fortran interface. A placeholder, if present, is used for the C interface data types in the C interface section above. See the C Interface Conventions section for the C interface principal conventions and type definitions. compz CHARACTER*1. Must be 'N' or 'I' or 'V'. If compz = 'N', the routine computes eigenvalues only. If compz = 'I', the routine computes the eigenvalues and eigenvectors of the tridiagonal matrix. If compz = 'V', the routine computes the eigenvalues and eigenvectors of original symmetric/Hermitian matrix. On entry, the array z must contain the orthogonal/unitary matrix used to reduce the original matrix to tridiagonal form. n INTEGER. The order of the symmetric tridiagonal matrix (n = 0). 4 Intel® Math Kernel Library Reference Manual 802 d, e, rwork REAL for single-precision flavors DOUBLE PRECISION for double-precision flavors. Arrays: d(*) contains the diagonal elements of the tridiagonal matrix. The dimension of d must be at least max(1, n). e(*) contains the subdiagonal elements of the tridiagonal matrix. The dimension of e must be at least max(1, n-1). rwork is a workspace array, its dimension max(1, lrwork). z, work REAL for sstedc DOUBLE PRECISION for dstedc COMPLEX for cstedc DOUBLE COMPLEX for zstedc. Arrays: z(ldz, *), work(*). If compz = 'V', then, on entry, z must contain the orthogonal/unitary matrix used to reduce the original matrix to tridiagonal form. The second dimension of z must be at least max(1, n). work is a workspace array, its dimension max(1, lwork). ldz INTEGER. The leading dimension of z. Constraints: ldz = 1 if compz = 'N'; ldz = max(1, n) if compz = 'V' or 'I'. lwork INTEGER. The dimension of the array work. For real functions sstedc and dstedc: • If compz = 'N'or n = 1, lwork must be at least 1. • If compz = 'V' and n > 1, lwork must be at least 1 + 3*n + 2*n*log2(n) + 4*n2, where log2(n) is the smallest integer k such that 2k=n. • If compz = 'I' and n > 1 then lwork must be at least 1 + 4*n + n2 Note that for compz = 'I' or 'V' and if n is less than or equal to the minimum divide size, usually 25, then lwork need only be max(1, 2*(n-1)). For complex functions cstedc and zstedc: • If compz = 'N'or 'I', or n = 1, lwork must be at least 1. • If compz = 'V' and n > 1, lwork must be at least n2. Note that for compz = 'V', and if n is less than or equal to the minimum divide size, usually 25, then lwork need only be 1. If lwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of lwork. lrwork INTEGER. The dimension of the array rwork (used for complex flavors only). If compz = 'N', or n = 1, lrwork must be at least 1. If compz = 'V' and n > 1, lrwork must be at least (1+3*n+2*n*lg(n) +4*n*n), where lg(n)is the smallest integer k such that 2**k=n. If compz = 'I' and n > 1, lrwork must be at least (1+4*n+2*n*n). LAPACK Routines: Least Squares and Eigenvalue Problems 4 803 Note that for compz = 'V'or 'I', and if n is less than or equal to the minimum divide size, usually 25, then lrwork need only be max(1, 2*(n-1)). If lrwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of lrwork. iwork INTEGER. Workspace array, its dimension max(1, liwork). liwork INTEGER. The dimension of the array iwork. If compz = 'N', or n = 1, liwork must be at least 1. If compz = 'V' and n > 1, liwork must be at least (6+6*n+5*n*lg(n), where lg(n)is the smallest integer k such that 2**k=n. If compz = 'I' and n > 1, liwork must be at least (3+5*n). Note that for compz = 'V'or 'I', and if n is less than or equal to the minimum divide size, usually 25, then liwork need only be 1. If liwork = -1, then a workspace query is assumed; the routine only calculates the optimal size of the work, rwork and iwork arrays, returns these values as the first entries of the work, rwork and iwork arrays, and no error message related to lwork or lrwork or liwork is issued by xerbla. See Application Notes for the required value of liwork. Output Parameters d The n eigenvalues in ascending order, unless info ? 0. See also info. e On exit, the array is overwritten; see info. z If info = 0, then if compz = 'V', z contains the orthonormal eigenvectors of the original symmetric/Hermitian matrix, and if compz = 'I', z contains the orthonormal eigenvectors of the symmetric tridiagonal matrix. If compz = 'N', z is not referenced. work(1) On exit, if info = 0, then work(1) returns the optimal lwork. rwork(1) On exit, if info = 0, then rwork(1) returns the optimal lrwork (for complex flavors only). iwork(1) On exit, if info = 0, then iwork(1) returns the optimal liwork. info INTEGER. If info = 0, the execution is successful. If info = -i, the i-th parameter had an illegal value. If info = i, the algorithm failed to compute an eigenvalue while working on the submatrix lying in rows and columns i/(n+1) through mod(i, n+1). Fortran 95 Interface Notes Routines in Fortran 95 interface have fewer arguments in the calling sequence than their FORTRAN 77 counterparts. For general conventions applied to skip redundant or restorable arguments, see Fortran 95 Interface Conventions. Specific details for the routine stedc interface are the following: d Holds the vector of length n. e Holds the vector of length (n-1). z Holds the matrix Z of size (n,n). 4 Intel® Math Kernel Library Reference Manual 804 compz If omitted, this argument is restored based on the presence of argument z as follows: compz = 'I', if z is present, compz = 'N', if z is omitted. If present, compz must be equal to 'I' or 'V' and the argument z must also be present. Note that there will be an error condition if compz is present and z omitted. Note that two variants of Fortran 95 interface for stedc routine are needed because of an ambiguous choice between real and complex cases appear when z and work are omitted. Thus, the name rstedc is used in real cases (single or double precision), and the name stedc is used in complex cases (single or double precision). Application Notes The required size of workspace arrays must be as follows. For sstedc/dstedc: If compz = 'N' or n = 1 then lwork must be at least 1. If compz = 'V' and n > 1 then lwork must be at least (1 + 3n + 2n·lgn + 3n2), where lg(n) = smallest integer k such that 2k= n. If compz = 'I' and n > 1 then lwork must be at least (1 + 4n + n2). If compz = 'N' or n = 1 then liwork must be at least 1. If compz = 'V' and n > 1 then liwork must be at least (6 + 6n + 5n·lgn). If compz = 'I' and n > 1 then liwork must be at least (3 + 5n). For cstedc/zstedc: If compz = 'N' or'I', or n = 1, lwork must be at least 1. If compz = 'V' and n > 1, lwork must be at least n2. If compz = 'N' or n = 1, lrwork must be at least 1. If compz = 'V' and n > 1, lrwork must be at least (1 + 3n + 2n·lgn + 3n2), where lg(n ) = smallest integer k such that 2k= n. If compz = 'I' and n > 1, lrwork must be at least(1 + 4n + 2n2). The required value of liwork for complex flavors is the same as for real flavors. If lwork (or liwork or lrwork, if supplied) is equal to -1, then the routine returns immediately and provides the recommended workspace in the first element of the corresponding array (work, iwork, rwork). This operation is called a workspace query. Note that if lwork (liwork, lrwork) is less than the minimal required value and is not equal to -1, the routine returns immediately with an error exit and does not provide any information on the recommended workspace. ?stegr Computes selected eigenvalues and eigenvectors of a real symmetric tridiagonal matrix. Syntax Fortran 77: call sstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) LAPACK Routines: Least Squares and Eigenvalue Problems 4 805 call dstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call cstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) call zstegr(jobz, range, n, d, e, vl, vu, il, iu, abstol, m, w, z, ldz, isuppz, work, lwork, iwork, liwork, info) Fortran 95: call rstegr(d, e, w [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) call stegr(d, e, w [,z] [,vl] [,vu] [,il] [,iu] [,m] [,isuppz] [,abstol] [,info]) C: lapack_int LAPACKE_sstegr( int matrix_order, char jobz, char range, lapack_int n, float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, float* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_dstegr( int matrix_order, char jobz, char range, lapack_int n, double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, double* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_cstegr( int matrix_order, char jobz, char range, lapack_int n, float* d, float* e, float vl, float vu, lapack_int il, lapack_int iu, float abstol, lapack_int* m, float* w, lapack_complex_float* z, lapack_int ldz, lapack_int* isuppz ); lapack_int LAPACKE_zstegr( int matrix_order, char jobz, char range, lapack_int n, double* d, double* e, double vl, double vu, lapack_int il, lapack_int iu, double abstol, lapack_int* m, double* w, lapack_complex_double* z, lapack_int ldz, lapack_int* isuppz ); Include Files • Fortran: mkl_lapack.fi and mkl_lapack.h • Fortran 95: lapack.f90 • C: mkl_lapacke.h Description The routine computes selected eigenvalues and, optionally, eigenvectors of a real symmetric tridiagonal matrix T. Any such unreduced matrix has a well defined set of pairwise different real eigenvalues, the corresponding real eigenvectors are pairwise orthogonal. The spectrum may be computed either completely or partially by specifying either an interval (vl,vu] or a range of indices il:iu for the desired eigenvalues. ?sregr is a compatibility wrapper around the improved stemr routine. See its description for further details. Note that the abstol parameter no longer provides any benefit and hence is no longer used. See also auxiliary lasq2 lasq5, lasq6, used by this routine. Input Parameters The data types are given for the Fortran interface. A : distributed matrix-matrix product, triangular matrix, double-precision complex. PBLAS Level 1 Routines PBLAS Level 1 includes routines and functions that perform distributed vector-vector operations. Table "PBLAS Level 1 Routine Groups and Their Data Types" lists the PBLAS Level 1 routine groups and the data types associated with them. PBLAS Level 1 Routine Groups and Their Data Types Routine or Function Group Data Types Description p?amax s, d, c, z Calculates an index of the distributed vector element with maximum absolute value p?asum s, d, sc, dz Calculates sum of magnitudes of a distributed vector p?axpy s, d, c, z Calculates distributed vector-scalar product p?copy s, d, c, z Copies a distributed vector p?dot s, d Calculates a dot product of two distributed real vectors p?dotc c, z Calculates a dot product of two distributed complex vectors, one of them is conjugated PBLAS Routines 12 2375 Routine or Function Group Data Types Description p?dotu c, z Calculates a dot product of two distributed complex vectors p?nrm2 s, d, sc, dz Calculates the 2-norm (Euclidean norm) of a distributed vector p?scal s, d, c, z, cs, zd Calculates a product of a distributed vector by a scalar p?swap s, d, c, z Swaps two distributed vectors p?amax Computes the global index of the element of a distributed vector with maximum absolute value. Syntax call psamax(n, amax, indx, x, ix, jx, descx, incx) call pdamax(n, amax, indx, x, ix, jx, descx, incx) call pcamax(n, amax, indx, x, ix, jx, descx, incx) call pzamax(n, amax, indx, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The functions p?amax compute global index of the maximum element in absolute value of a distributed vector sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psamax DOUBLE PRECISION for pdamax COMPLEX for pcamax DOUBLE COMPLEX for pzamax Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters amax (global) REAL for psamax. 12 Intel® Math Kernel Library Reference Manual 2376 DOUBLE PRECISION for pdamax. COMPLEX for pcamax. DOUBLE COMPLEX for pzamax. Maximum absolute value (magnitude) of elements of the distributed vector only in its scope. indx (global) INTEGER. The global index of the maximum element in absolute value of the distributed vector sub(x) only in its scope. p?asum Computes the sum of magnitudes of elements of a distributed vector. Syntax call psasum(n, asum, x, ix, jx, descx, incx) call pscasum(n, asum, x, ix, jx, descx, incx) call pdasum(n, asum, x, ix, jx, descx, incx) call pdzasum(n, asum, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The functions p?asum compute the sum of the magnitudes of elements of a distributed vector sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psasum DOUBLE PRECISION for pdasum COMPLEX for pscasum DOUBLE COMPLEX for pdzasum Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters asum (local) REAL for psasum and pscasum. DOUBLE PRECISION for pdasum and pdzasum Contains the sum of magnitudes of elements of the distributed vector only in its scope. PBLAS Routines 12 2377 p?axpy Computes a distributed vector-scalar product and adds the result to a distributed vector. Syntax call psaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pcaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzaxpy(n, a, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?axpy routines perform the following operation with distributed vectors: sub(y) := sub(y) + a*sub(x) where: a is a scalar; sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. a (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy DOUBLE COMPLEX for pzaxpy Specifies the scalar a. x (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy DOUBLE COMPLEX for pzaxpy Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psaxpy DOUBLE PRECISION for pdaxpy COMPLEX for pcaxpy 12 Intel® Math Kernel Library Reference Manual 2378 DOUBLE COMPLEX for pzaxpy Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global)INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by sub(y) := sub(y) + a*sub(x). p?copy Copies one distributed vector to another vector. Syntax call pscopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdcopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pccopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzcopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call picopy(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?copy routines perform a copy operation with distributed vectors defined as sub(y) = sub(x), where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for pscopy DOUBLE PRECISION for pdcopy COMPLEX for pccopy DOUBLE COMPLEX for pzcopy INTEGER for picopy Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2379 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for pscopy DOUBLE PRECISION for pdcopy COMPLEX for pccopy DOUBLE COMPLEX for pzcopy INTEGER for picopy Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global)INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten with the distributed vector sub(x). p?dot Computes the dot product of two distributed real vectors. Syntax call psdot(n, dot, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pddot(n, dot, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The ?dot functions compute the dot product dot of two distributed real vectors defined as dot = sub(x)'*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for psdot 12 Intel® Math Kernel Library Reference Manual 2380 DOUBLE PRECISION for pddot Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psdot DOUBLE PRECISION for pddot Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dot (local) REAL for psdot DOUBLE PRECISION for pddot Dot product of sub(x) and sub(y) only in their scope. p?dotc Computes the dot product of two distributed complex vectors, one of them is conjugated. Syntax call pcdotc(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzdotc(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?dotu functions compute the dot product dotc of two distributed vectors one of them is conjugated: dotc = conjg(sub(x)')*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. PBLAS Routines 12 2381 Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dotc (local) COMPLEX for pcdotc DOUBLE COMPLEX for pzdotc Dot product of sub(x) and sub(y) only in their scope. p?dotu Computes the dot product of two distributed complex vectors. Syntax call pcdotu(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzdotu(n, dotu, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?dotu functions compute the dot product dotu of two distributed vectors defined as dotu = sub(x)'*sub(y) where sub(x) and sub(y) are n-element distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. 12 Intel® Math Kernel Library Reference Manual 2382 Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters dotu (local) COMPLEX for pcdotu DOUBLE COMPLEX for pzdotu Dot product of sub(x) and sub(y) only in their scope. p?nrm2 Computes the Euclidean norm of a distributed vector. Syntax call psnrm2(n, norm2, x, ix, jx, descx, incx) call pdnrm2(n, norm2, x, ix, jx, descx, incx) call pscnrm2(n, norm2, x, ix, jx, descx, incx) call pdznrm2(n, norm2, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?nrm2 functions compute the Euclidean norm of a distributed vector sub(x), where sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. PBLAS Routines 12 2383 Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. x (local) REAL for psnrm2 DOUBLE PRECISION for pdnrm2 COMPLEX for pscnrm2 DOUBLE COMPLEX for pdznrm2 Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters norm2 (local) REAL for psnrm2 and pscnrm2. DOUBLE PRECISION for pdnrm2 and pdznrm2 Contains the Euclidean norm of a distributed vector only in its scope. p?scal Computes a product of a distributed vector by a scalar. Syntax call psscal(n, a, x, ix, jx, descx, incx) call pdscal(n, a, x, ix, jx, descx, incx) call pcscal(n, a, x, ix, jx, descx, incx) call pzscal(n, a, x, ix, jx, descx, incx) call pcsscal(n, a, x, ix, jx, descx, incx) call pzdscal(n, a, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?scal routines multiplies a n-element distributed vector sub(x) by the scalar a: sub(x) = a*sub(x), where sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1. Input Parameters n (global) INTEGER. The length of distributed vector sub(x), n=0. a (global) REAL for psscal and pcsscal DOUBLE PRECISION for pdscal and pzdscal 12 Intel® Math Kernel Library Reference Manual 2384 COMPLEX for pcscal DOUBLE COMPLEX for pzscal Specifies the scalar a. x (local) REAL for psscal DOUBLE PRECISION for pdscal COMPLEX for pcscal and pcsscal DOUBLE COMPLEX for pzscal and pzdscal Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten by the updated distributed vector sub(x) p?swap Swaps two distributed vectors. Syntax call psswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pdswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pcswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) call pzswap(n, x, ix, jx, descx, incx, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description Given two distributed vectors sub(x) and sub(y), the p?swap routines return vectors sub(y) and sub(x) swapped, each replacing the other. Here sub(x) denotes X(ix, jx:jx+n-1) if incx=m_x, and X(ix: ix+n-1, jx) if incx= 1; sub(y) denotes Y(iy, jy:jy+n-1) if incy=m_y, and Y(iy: iy+n-1, jy) if incy= 1. Input Parameters n (global) INTEGER. The length of distributed vectors, n=0. x (local) REAL for psswap DOUBLE PRECISION for pdswap COMPLEX for pcswap DOUBLE COMPLEX for pzswap Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2385 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(X), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psswap DOUBLE PRECISION for pdswap COMPLEX for pcswap DOUBLE COMPLEX for pzswap Array, DIMENSION (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(Y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters x Overwritten by distributed vector sub(y). y Overwritten by distributed vector sub(x). PBLAS Level 2 Routines This section describes PBLAS Level 2 routines, which perform distributed matrix-vector operations. Table "PBLAS Level 2 Routine Groups and Their Data Types" lists the PBLAS Level 2 routine groups and the data types associated with them. PBLAS Level 2 Routine Groups and Their Data Types Routine Groups Data Types Description p?gemv s, d, c, z Matrix-vector product using a distributed general matrix p?agemv s, d, c, z Matrix-vector product using absolute values for a distributed general matrix p?ger s, d Rank-1 update of a distributed general matrix p?gerc c, z Rank-1 update (conjugated) of a distributed general matrix p?geru c, z Rank-1 update (unconjugated) of a distributed general matrix p?hemv c, z Matrix-vector product using a distributed Hermitian matrix p?ahemv c, z Matrix-vector product using absolute values for a distributed Hermitian matrix p?her c, z Rank-1 update of a distributed Hermitian matrix p?her2 c, z Rank-2 update of a distributed Hermitian matrix 12 Intel® Math Kernel Library Reference Manual 2386 Routine Groups Data Types Description p?symv s, d Matrix-vector product using a distributed symmetric matrix p?asymv s, d Matrix-vector product using absolute values for a distributed symmetric matrix p?syr s, d Rank-1 update of a distributed symmetric matrix p?syr2 s, d Rank-2 update of a distributed symmetric matrix p?trmv s, d, c, z Distributed matrix-vector product using a triangular matrix p?atrmv s, d, c, z Distributed matrix-vector product using absolute values for a triangular matrix p?trsv s, d, c, z Solves a system of linear equations whose coefficients are in a distributed triangular matrix p?gemv Computes a distributed matrix-vector product using a general matrix. Syntax call psgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzgemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?gemv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), or sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y), or sub(y) := alpha*conjg(sub(A)')*sub(x) + beta*sub(y), where alpha and beta are scalars, sub(A) is a m-by-n submatrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) and sub(y) are subvectors. PBLAS Routines 12 2387 When trans = 'N' or 'n', sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+m-1) if incy = m_y, and Y(iy: iy+m-1, jy) if incy = 1. When trans = 'T' or 't', or 'C', or 'c', sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+m-1, jy) if incy = 1. Input Parameters trans (global) CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y); if trans= 'T' or 't', then sub(y) := alpha*sub(A)'*sub(x) + beta*sub(y); if trans= 'C' or 'c', then sub(y) := alpha*conjg(subA)')*sub(x) + beta*sub(y). m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Specifies the scalar alpha. a (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)) when trans = 'N' or 'n', and (jx-1)*m_x + ix+(m-1)*abs(incx)) otherwise. This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psgemv 12 Intel® Math Kernel Library Reference Manual 2388 DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psgemv DOUBLE PRECISION for pdgemv COMPLEX for pcgemv DOUBLE COMPLEX for pzgemv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?agemv Computes a distributed matrix-vector product using absolute values for a general matrix. Syntax call psagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzagemv(trans, m, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?agemv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A)')*abs(sub(x)) + abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(sub(A)')*abs(sub(x)) + abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(conjg(sub(A)'))*abs(sub(x)) + abs(beta*sub(y)), PBLAS Routines 12 2389 where alpha and beta are scalars, sub(A) is a m-by-n submatrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) and sub(y) are subvectors. When trans = 'N' or 'n', sub(x) denotes X(ix:ix, jx:jx+n-1) if incx = m_x, and X(ix:ix+n-1, jx:jx) if incx = 1, sub(y) denotes Y(iy:iy, jy:jy+m-1) if incy = m_y, and Y(iy:iy+m-1, jy:jy) if incy = 1. When trans = 'T' or 't', or 'C', or 'c', sub(x) denotes X(ix:ix, jx:jx+m-1) if incx = m_x, and X(ix:ix+m-1, jx:jx) if incx = 1, sub(y) denotes Y(iy:iy, jy:jy+n-1) if incy = m_y, and Y(iy:iy+m-1, jy:jy) if incy = 1. Input Parameters trans (global) CHARACTER*1. Specifies the operation: if trans= 'N' or 'n', then sub(y) := |alpha|*|sub(A)|*|sub(x)| + |beta*sub(y)| if trans= 'T' or 't', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| + |beta*sub(y)| if trans= 'C' or 'c', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| + |beta*sub(y)|. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Specifies the scalar alpha. a (local) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psagemv DOUBLE PRECISION for pdagemv 12 Intel® Math Kernel Library Reference Manual 2390 COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (jx-1)*m_x + ix+(n-1)*abs(incx)) when trans = 'N' or 'n', and (jx-1)*m_x + ix+(m-1)*abs(incx)) otherwise. This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psagemv DOUBLE PRECISION for pdagemv COMPLEX for pcagemv DOUBLE COMPLEX for pzagemv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?ger Performs a rank-1 update of a distributed general matrix. Syntax call psger(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pdger(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h PBLAS Routines 12 2391 Description The p?ger routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)' + sub(A), where: alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A)=A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is an n-element distributed vector, sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m=0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n=0. alpha (global) REAL for psger DOUBLE REAL for pdger Specifies the scalar alpha. x (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION at least (jx-1)*m_x + ix+(m-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) REAL for psger DOUBLE REAL for pdger Array, DIMENSION (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2392 ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a Overwritten by the updated distributed matrix sub(A). p?gerc Performs a rank-1 update (conjugated) of a distributed general matrix. Syntax call pcgerc(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzgerc(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?gerc routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conjg(sub(y)') + sub(A), where: alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A) = A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is ann-element distributed vector, sub(x)denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y)denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Specifies the scalar alpha. x (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). PBLAS Routines 12 2393 ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcgerc DOUBLE COMPLEX for pzgerc Array, DIMENSION at least (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a Overwritten by the updated distributed matrix sub(A). p?geru Performs a rank-1 update (unconjugated) of a distributed general matrix. Syntax call pcgeru(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzgeru(m, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?geru routines perform a matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)' + sub(A), where: 12 Intel® Math Kernel Library Reference Manual 2394 alpha is a scalar, sub(A) is a m-by-n distributed general matrix, sub(A)=A(ia:ia+m-1, ja:ja+n-1), sub(x) is an m-element distributed vector, sub(y) is an n-element distributed vector, sub(x) denotes X(ix, jx:jx+m-1) if incx = m_x, and X(ix: ix+m-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Specifies the scalar alpha. x (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcgeru DOUBLE COMPLEX for pzgeru Array, DIMENSION at least (lld_a, LOCq(ja+n-1)). Before entry this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. PBLAS Routines 12 2395 Output Parameters a Overwritten by the updated distributed matrix sub(A). p?hemv Computes a distributed matrix-vector product using a Hermitian matrix. Syntax call pchemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzhemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?hemv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), where: alpha and beta are scalars, sub(A) is a n-by-n Hermitian distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Specifies the scalar alpha. a (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower 12 Intel® Math Kernel Library Reference Manual 2396 triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) COMPLEX for pchemv DOUBLE COMPLEX for pzhemv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?ahemv Computes a distributed matrix-vector product using absolute values for a Hermitian matrix. Syntax call pcahemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pzahemv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) PBLAS Routines 12 2397 Include Files • C: mkl_pblas.h Description The p?ahemv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x)) + abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n Hermitian distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Specifies the scalar alpha. a (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. 12 Intel® Math Kernel Library Reference Manual 2398 descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) COMPLEX for pcahemv DOUBLE COMPLEX for pzahemv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?her Performs a rank-1 update of a distributed Hermitian matrix. Syntax call pcher(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) call pzher(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?her routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conjg(sub(x)') + sub(A), where: alpha is a real scalar, sub(A) is a n-by-n distributed Hermitian matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1), sub(x) is distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: PBLAS Routines 12 2399 If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pcher DOUBLE REAL for pzher Specifies the scalar alpha. x (local) COMPLEX for pcher DOUBLE COMPLEX for pzher Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. a (local) COMPLEX for pcher DOUBLE COMPLEX for pzher Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?her2 Performs a rank-2 update of a distributed Hermitian matrix. 12 Intel® Math Kernel Library Reference Manual 2400 Syntax call pcher2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pzher2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?her2 routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*conj(sub(y)')+ conj(alpha)*sub(y)*conj(sub(x)') + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed Hermitian matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1), sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the distributed Hermitian matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Specifies the scalar alpha. x (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). PBLAS Routines 12 2401 iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) COMPLEX for pcher2 DOUBLE COMPLEX for pzher2 Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?symv Computes a distributed matrix-vector product using a symmetric matrix. Syntax call pssymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdsymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?symv routines perform a distributed matrix-vector operation defined as sub(y) := alpha*sub(A)*sub(x) + beta*sub(y), where: 12 Intel® Math Kernel Library Reference Manual 2402 alpha and beta are scalars, sub(A) is a n-by-n symmetric distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssymv DOUBLE REAL for pdsymv Specifies the scalar alpha. a (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for pssymv DOUBLE REAL for pdsymv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. PBLAS Routines 12 2403 y (local) REAL for pssymv DOUBLE REAL for pdsymv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). p?asymv Computes a distributed matrix-vector product using absolute values for a symmetric matrix. Syntax call psasymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdasymv(uplo, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?symv routines perform a distributed matrix-vector operation defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x)) + abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n symmetric distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. 12 Intel® Math Kernel Library Reference Manual 2404 alpha (global) REAL for psasymv DOUBLE REAL for pdasymv Specifies the scalar alpha. a (local) REAL for psasymv DOUBLE REAL for pdasymv Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry when uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psasymv DOUBLE PRECISION for pdasymv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psasymv DOUBLE PRECISION for pdasymv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psasymv DOUBLE PRECISION for pdasymv Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. Output Parameters y Overwritten by the updated distributed vector sub(y). PBLAS Routines 12 2405 p?syr Performs a rank-1 update of a distributed symmetric matrix. Syntax call pssyr(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) call pdsyr(uplo, n, alpha, x, ix, jx, descx, incx, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?syr routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(x)' + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed symmetric matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) is distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssyr DOUBLE REAL for pdsyr Specifies the scalar alpha. x (local) REAL for pssyr DOUBLE REAL for pdsyr Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. a (local) REAL for pssyr DOUBLE REAL for pdsyr Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2406 Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?syr2 Performs a rank-2 update of a distributed symmetric matrix. Syntax call pssyr2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) call pdsyr2(uplo, n, alpha, x, ix, jx, descx, incx, y, iy, jy, descy, incy, a, ia, ja, desca) Include Files • C: mkl_pblas.h Description The p?syr2 routines perform a distributed matrix-vector operation defined as sub(A) := alpha*sub(x)*sub(y)'+ alpha*sub(y)*sub(x)' + sub(A), where: alpha is a scalar, sub(A) is a n-by-n distributed symmetric matrix, sub(A)=A(ia:ia+n-1, ja:ja+n-1) , sub(x) and sub(y) are distributed vectors. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, sub(y) denotes Y(iy, jy:jy+n-1) if incy = m_y, and Y(iy: iy+n-1, jy) if incy = 1. PBLAS Routines 12 2407 Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the distributed symmetric matrix sub(A) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(A) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(A) is used. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. alpha (global) REAL for pssyr2 DOUBLE REAL for pdsyr2 Specifies the scalar alpha. x (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. y (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION at least (jy-1)*m_y + iy+(n-1)*abs(incy)). This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. a (local) REAL for pssyr2 DOUBLE REAL for pdsyr2 Array, DIMENSION (lld_a, LOCq(ja+n-1)). This array contains the local pieces of the distributed matrix sub(A). Before entry with uplo = 'U' or 'u', the n-by-n upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the distributed symmetric matrix and the strictly lower triangular part of sub(A) is not referenced, and with uplo = 'L' or 'l', the n-by-n lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the distributed symmetric matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix A. 12 Intel® Math Kernel Library Reference Manual 2408 Output Parameters a With uplo = 'U' or 'u', the upper triangular part of the array a is overwritten by the upper triangular part of the updated distributed matrix sub(A). With uplo = 'L' or 'l', the lower triangular part of the array a is overwritten by the lower triangular part of the updated distributed matrix sub(A). p?trmv Computes a distributed matrix-vector product using a triangular matrix. Syntax call pstrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pdtrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pctrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pztrmv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?trmv routines perform one of the following distributed matrix-vector operations defined as sub(x) := sub(A)*sub(x), or sub(x) :=sub( A)'*sub(x), or sub(x) := conjg(sub(A)')*sub(x), where: sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1, Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if transa = 'N' or 'n', then sub(x) := sub(A)*sub(x); if transa = 'T' or 't', then sub(x) :=sub( A)'*sub(x); if transa = 'C' or 'c', then sub(x) := conjg(sub(A)')*sub(x). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n=0. PBLAS Routines 12 2409 a (local) REAL for pstrmv DOUBLE PRECISION for pdtrmv COMPLEX for pctrmv DOUBLE COMPLEX for pztrmv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pstrmv DOUBLE PRECISION for pdtrmv COMPLEX for pctrmv DOUBLE COMPLEX for pztrmv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten by the transformed distributed vector sub(x). p?atrmv Computes a distributed matrix-vector product using absolute values for a triangular matrix. Syntax call psatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pdatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) call pcatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) 12 Intel® Math Kernel Library Reference Manual 2410 call pzatrmv(uplo, trans, diag, n, alpha, a, ia, ja, desca, x, ix, jx, descx, incx, beta, y, iy, jy, descy, incy) Include Files • C: mkl_pblas.h Description The p?atrmv routines perform one of the following distributed matrix-vector operations defined as sub(y) := abs(alpha)*abs(sub(A))*abs(sub(x))+ abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(sub( A)')*abs(sub(x))+ abs(beta*sub(y)), or sub(y) := abs(alpha)*abs(conjg(sub(A)'))*abs(sub(x))+ abs(beta*sub(y)), where: alpha and beta are scalars, sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), sub(x) is an n-element distributed vector. sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1. Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if trans = 'N' or 'n', then sub(y) := |alpha|*|sub(A)|*|sub(x)|+| beta*sub(y)|; if trans = 'T' or 't', then sub(y) := |alpha|*|sub(A)'|*|sub(x)| +|beta*sub(y)|; if trans = 'C' or 'c', then sub(y) := |alpha|*|conjg(sub(A)')|*| sub(x)|+|beta*sub(y)|. diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n=0. alpha (global) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Specifies the scalar alpha. a (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). PBLAS Routines 12 2411 Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced. When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. beta (global) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Specifies the scalar beta. When beta is set to zero, then sub(y) need not be set on input. y (local) REAL for psatrmv DOUBLE PRECISION for pdatrmv COMPLEX for pcatrmv DOUBLE COMPLEX for pzatrmv Array, DIMENSION (jy-1)*m_y + iy+(m-1)*abs(incy)) when trans = 'N' or 'n', and (jy-1)*m_y + iy+(n-1)*abs(incy)) otherwise. This array contains the entries of the distributed vector sub(y). iy, jy (global) INTEGER. The row and column indices in the distributed matrix Y indicating the first row and the first column of the submatrix sub(y), respectively. descy (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix Y. incy (global) INTEGER. Specifies the increment for the elements of sub(y). Only two values are supported, namely 1 and m_y. incy must not be zero. 12 Intel® Math Kernel Library Reference Manual 2412 Output Parameters x Overwritten by the transformed distributed vector sub(x). p?trsv Solves a system of linear equations whose coefficients are in a distributed triangular matrix. Syntax call pstrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pdtrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pctrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) call pztrsv(uplo, trans, diag, n, a, ia, ja, desca, x, ix, jx, descx, incx) Include Files • C: mkl_pblas.h Description The p?trsv routines solve one of the systems of equations: sub(A)*sub(x) = b, or sub(A)'*sub(x) = b, or conjg(sub(A)')*sub(x) = b, where: sub(A) is a n-by-n unit, or non-unit, upper or lower triangular distributed matrix, sub(A) = A(ia:ia+n-1, ja:ja+n-1), b and sub(x) are n-element distributed vectors, sub(x) denotes X(ix, jx:jx+n-1) if incx = m_x, and X(ix: ix+n-1, jx) if incx = 1,. The routine does not test for singularity or near-singularity. Such tests must be performed before calling this routine. Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the form of the system of equations: if transa = 'N' or 'n', then sub(A)*sub(x) = b; if transa = 'T' or 't', then sub(A)'*sub(x) = b; if transa = 'C' or 'c', then conjg(sub(A)')*sub(x) = b. diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. n (global) INTEGER. Specifies the order of the distributed matrix sub(A), n = 0. a (local) REAL for pstrsv DOUBLE PRECISION for pdtrsv COMPLEX for pctrsv PBLAS Routines 12 2413 DOUBLE COMPLEX for pztrsv Array, DIMENSION at least (lld_a, LOCq(1, ja+n-1)). Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. x (local) REAL for pstrsv DOUBLE PRECISION for pdtrsv COMPLEX for pctrsv DOUBLE COMPLEX for pztrsv Array, DIMENSION at least (jx-1)*m_x + ix+(n-1)*abs(incx)). This array contains the entries of the distributed vector sub(x). Before entry, sub(x) must contain the n-element right-hand side distributed vector b. ix, jx (global) INTEGER. The row and column indices in the distributed matrix X indicating the first row and the first column of the submatrix sub(x), respectively. descx (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix X. incx (global) INTEGER. Specifies the increment for the elements of sub(x). Only two values are supported, namely 1 and m_x. incx must not be zero. Output Parameters x Overwritten with the solution vector. PBLAS Level 3 Routines The PBLAS Level 3 routines perform distributed matrix-matrix operations. Table "PBLAS Level 3 Routine Groups and Their Data Types" lists the PBLAS Level 3 routine groups and the data types associated with them. PBLAS Level 3 Routine Groups and Their Data Types Routine Group Data Types Description p?geadd s, d, c, z Distributed matrix-matrix sum of general matrices p?tradd s, d, c, z Distributed matrix-matrix sum of triangular matrices p?gemm s, d, c, z Distributed matrix-matrix product of general matrices 12 Intel® Math Kernel Library Reference Manual 2414 Routine Group Data Types Description p?hemm c, z Distributed matrix-matrix product, one matrix is Hermitian p?herk c, z Rank-k update of a distributed Hermitian matrix p?her2k c, z Rank-2k update of a distributed Hermitian matrix p?symm s, d, c, z Matrix-matrix product of distributed symmetric matrices p?syrk s, d, c, z Rank-k update of a distributed symmetric matrix p?syr2k s, d, c, z Rank-2k update of a distributed symmetric matrix p?tran s, d Transposition of a real distributed matrix p?tranc c, z Transposition of a complex distributed matrix (conjugated) p?tranu c, z Transposition of a complex distributed matrix p?trmm s, d, c, z Distributed matrix-matrix product, one matrix is triangular p?trsm s, d, c, z Solution of a distributed matrix equation, one matrix is triangular p?geadd Performs sum operation for two distributed general matrices. Syntax call psgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pcgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzgeadd(trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?geadd routines perform sum operation for two distributed general matrices. The operation is defined as sub(C):=beta*sub(C) + alpha*op(sub(A)), where: op(x) is one of op(x) = x, or op(x) = x', alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters trans (global) CHARACTER*1. Specifies the operation: PBLAS Routines 12 2415 if trans = 'N' or 'n', then op(sub(A)) := sub(A); if trans = 'T' or 't', then op(sub(A)) := sub(A)'; if trans = 'C' or 'c', then op(sub(A)) := sub(A)'. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C) and the number of columns of the submatrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) and the number of rows of the submatrix sub(A), n = 0. alpha (global) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Specifies the scalar alpha. a (local) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for psgeadd DOUBLE PRECISION for pdgeadd COMPLEX for pcgeadd DOUBLE COMPLEX for pzgeadd Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tradd Performs sum operation for two distributed triangular matrices. 12 Intel® Math Kernel Library Reference Manual 2416 Syntax call pstradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdtradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pctradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztradd(uplo, trans, m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tradd routines perform sum operation for two distributed triangular matrices. The operation is defined as sub(C):=beta*sub(C) + alpha*op(sub(A)), where: op(x) is one of op(x) = x, or op(x) = x', or op(x) = conjg(x'). alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(C) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then op(sub(A)) := sub(A); if trans = 'T' or 't', then op(sub(A)) := sub(A)'; if trans = 'C' or 'c', then op(sub(A)) := conjg(sub(A)'). m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C) and the number of columns of the submatrix sub(A), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) and the number of rows of the submatrix sub(A), n = 0. alpha (global) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Specifies the scalar alpha. a (local) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. PBLAS Routines 12 2417 desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for pstradd DOUBLE PRECISION for pdtradd COMPLEX for pctradd DOUBLE COMPLEX for pztradd Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?gemm Computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product for distributed matrices. Syntax call psgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzgemm(transa, transb, m, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?gemm routines perform a matrix-matrix operation with general distributed matrices. The operation is defined as sub(C) := alpha*op(sub(A))*op(sub(B)) + beta*sub(C), where: op(x) is one of op(x) = x, or op(x) = x', alpha and beta are scalars, 12 Intel® Math Kernel Library Reference Manual 2418 sub(A)=A(ia:ia+m-1, ja:ja+k-1), sub(B)=B(ib:ib+k-1, jb:jb+n-1), and sub(C)=C(ic:ic+m-1, jc:jc+n-1), are distributed matrices. Input Parameters transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix multiplication: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)'; if transa = 'C' or 'c', then op(sub(A)) = sub(A)'. transb (global) CHARACTER*1. Specifies the form of op(sub(B)) used in the matrix multiplication: if transb = 'N' or 'n', then op(sub(B)) = sub(B); if transb = 'T' or 't', then op(sub(B)) = sub(B)'; if transb = 'C' or 'c', then op(sub(B)) = sub(B)'. m (global) INTEGER. Specifies the number of rows of the distributed matrices op(sub(A)) and sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrices op(sub(B)) and sub(C), n = 0. The value of n must be at least zero. k (global) INTEGER. Specifies the number of columns of the distributed matrix op(sub(A)) and the number of rows of the distributed matrix op(sub(B)). The value of k must be greater than or equal to 0. alpha (global) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Specifies the scalar alpha. When alpha is equal to zero, then the local entries of the arrays a and b corresponding to the entries of the submatrices sub(A) and sub(B) respectively need not be set on input. a (local) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_a, kla), where kla is LOCc(ja+k-1) when transa = 'N' or 'n', and is LOCq(ja+m-1) otherwise. Before entry this array must contain the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local)REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_b, klb), where klb is LOCc(jb+n-1) when transb = 'N' or 'n', and is LOCq(jb+k-1) otherwise. Before entry this array must contain the local pieces of the distributed matrix sub(B). PBLAS Routines 12 2419 ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local)REAL for psgemm DOUBLE PRECISION for pdgemm COMPLEX for pcgemm DOUBLE COMPLEX for pzgemm Array, DIMENSION (lld_a, LOCq(jc+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n distributed matrix alpha*op(sub(A))*op(sub(B)) + beta*sub(C). p?hemm Performs a scalar-matrix-matrix product (one matrix operand is Hermitian) and adds the result to a scalarmatrix product. Syntax call pchemm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzhemm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?hemm routines perform a matrix-matrix operation with distributed matrices. The operation is defined as sub(C):=alpha*sub(A)*sub(B)+ beta*sub(C), or sub(C):=alpha*sub(B)*sub(A)+ beta*sub(C), where: alpha and beta are scalars, 12 Intel® Math Kernel Library Reference Manual 2420 sub(A) is a Hermitian distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R'. sub(B) and sub(C) are m-by-n distributed matrices. sub(B)=B(ib:ib+m-1, jb:jb+n-1), sub(C)=C(ic:ic+m-1, jc:jc+n-1). Input Parameters side (global) CHARACTER*1. Specifies whether the Hermitian distributed matrix sub(A) appears on the left or right in the operation: if side = 'L' or 'l', then sub(C) := alpha*sub(A) *sub(B) + beta*sub(C); if side = 'R' or 'r', then sub(C) := alpha*sub(B) *sub(A) + beta*sub(C). uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(A) is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m (global) INTEGER. Specifies the number of rows of the distribute submatrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distribute submatrix sub(C), n = 0. alpha (global) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Specifies the scalar alpha. a (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_a, LOCq(ja+na-1)). Before entry this array must contain the local pieces of the symmetric distributed matrix sub(A), such that when uplo = 'U' or 'u', the na-byna upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the Hermitian distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the na-by-na lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the Hermitian distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_b, LOCq(jb+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) COMPLEX for pchemm PBLAS Routines 12 2421 DOUBLE COMPLEX for pzhemm Specifies the scalar beta. When beta is set to zero, then sub(C) need not be set on input. c (local) COMPLEX for pchemm DOUBLE COMPLEX for pzhemm Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry this array must contain the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively descc (global and local)INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n updated distributed matrix. p?herk Performs a rank-k update of a distributed Hermitian matrix. Syntax call pcherk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzherk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?herk routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*conjg(sub(A)')+ beta*sub(C), or sub(C):=alpha*conjg(sub(A)')*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n Hermitian distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: 12 Intel® Math Kernel Library Reference Manual 2422 if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*conjg(sub(A)') + beta*sub(C); if trans = 'C' or 'c', then sub(C) := alpha*conjg(sub(A)')*sub(A) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrix sub(A) , and on entry with trans = 'T' or 't' or 'C' or 'c', k specifies the number of rows of the distributed matrix sub(A), k = 0. alpha (global) REAL for pcherk DOUBLE PRECISION for pzherk Specifies the scalar alpha. a (local) COMPLEX for pcherk DOUBLE COMPLEX for pzherk Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pcherk DOUBLE PRECISION for pzherk Specifies the scalar beta. c (local) COMPLEX for pcherk DOUBLE COMPLEX for pzherk Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2423 p?her2k Performs a rank-2k update of a Hermitian distributed matrix. Syntax Fortran 77: call pcher2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzher2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?her2k routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*conjg(sub(B)')+ conjg(alpha)*sub(B)*conjg(sub(A)')+beta*sub(C), or sub(C):=alpha*conjg(sub(A)')*sub(A)+ conjg(alpha)*conjg(sub(B)')*sub(A) + beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n Hermitian distributed matrix, sub(C) = C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A) = A(ia:ia+k-1, ja:ja+n-1) otherwise. sub(B) is a distributed matrix, sub(B) = B(ib:ib+n-1, jb:jb+k-1), if trans = 'N' or 'n', and sub(B)=B(ib:ib+k-1, jb:jb+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the Hermitian distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*conjg(sub(B)') + conjg(alpha)*sub(B)*conjg(sub(A)') + beta*sub(C); if trans = 'C' or 'c', then sub(C) := alpha*conjg(sub(A)')*sub(A) + conjg(alpha)*conjg(sub(B)')*sub(A) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrices sub(A) and sub(B), and on entry with trans = 'C' or 'c' , k specifies the number of rows of the distributed matrices sub(A) and sub(B), k = 0. alpha (global) COMPLEX for pcher2k 12 Intel® Math Kernel Library Reference Manual 2424 DOUBLE COMPLEX for pzher2k Specifies the scalar alpha. a (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_b, klb), where klb is LOCq(jb+k-1) when trans = 'N' or 'n', and is LOCq(jb+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pcher2k DOUBLE PRECISION for pzher2k Specifies the scalar beta. c (local) COMPLEX for pcher2k DOUBLE COMPLEX for pzher2k Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2425 p?symm Performs a scalar-matrix-matrix product (one matrix operand is symmetric) and adds the result to a scalarmatrix product for distribute matrices. Syntax Fortran 77: call pssymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzsymm(side, uplo, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?symm routines perform a matrix-matrix operation with distributed matrices. The operation is defined as sub(C):=alpha*sub(A)*sub(B)+ beta*sub(C), or sub(C):=alpha*sub(B)*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(A) is a symmetric distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side ='L', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side ='R'. sub(B) and sub(C) are m-by-n distributed matrices. sub(B)=B(ib:ib+m-1, jb:jb+n-1), sub(C)=C(ic:ic+m-1, jc:jc+n-1). Input Parameters side (global) CHARACTER*1. Specifies whether the symmetric distributed matrix sub(A) appears on the left or right in the operation: if side = 'L' or 'l', then sub(C) := alpha*sub(A) *sub(B) + beta*sub(C); if side = 'R' or 'r', then sub(C) := alpha*sub(B) *sub(A) + beta*sub(C). uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(A) is used: if uplo = 'U' or 'u', then the upper triangular part is used; if uplo = 'L' or 'l', then the lower triangular part is used. m (global) INTEGER. Specifies the number of rows of the distribute submatrix sub(C), m = 0. 12 Intel® Math Kernel Library Reference Manual 2426 n (global) INTEGER. Specifies the number of columns of the distribute submatrix sub(C), m = 0. alpha (global) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Specifies the scalar alpha. a (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_a, LOCq(ja+na-1)). Before entry this array must contain the local pieces of the symmetric distributed matrix sub(A), such that when uplo = 'U' or 'u', the na-byna upper triangular part of the distributed matrix sub(A) must contain the upper triangular part of the symmetric distributed matrix and the strictly lower triangular part of sub(A) is not referenced, and when uplo = 'L' or 'l', the na-by-na lower triangular part of the distributed matrix sub(A) must contain the lower triangular part of the symmetric distributed matrix and the strictly upper triangular part of sub(A) is not referenced. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_b, LOCq(jb+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Specifies the scalar beta. When beta is set to zero, then sub(C) need not be set on input. c (local) REAL for pssymm DOUBLE PRECISION for pdsymm COMPLEX for pcsymm DOUBLE COMPLEX for pzsymm Array, DIMENSION (lld_c, LOCq(jc+n-1) ). Before entry this array must contain the local pieces of the distributed matrix sub(C). PBLAS Routines 12 2427 ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the m-by-n updated matrix. p?syrk Performs a rank-k update of a symmetric distributed matrix. Syntax call pssyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pcsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pzsyrk(uplo, trans, n, k, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?syrk routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*sub(A)'+ beta*sub(C), or sub(C):=alpha*sub(A)'*sub(A)+ beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n symmetric distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*sub(A)' + beta*sub(C); if trans = 'T' or 't', then sub(C) := alpha*sub(A)'*sub(A) + beta*sub(C). 12 Intel® Math Kernel Library Reference Manual 2428 n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrix sub(A) , and on entry with trans = 'T' or 't' , k specifies the number of rows of the distributed matrix sub(A), k = 0. alpha (global) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Specifies the scalar alpha. a (local) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Specifies the scalar beta. c (local) REAL for pssyrk DOUBLE PRECISION for pdsyrk COMPLEX for pcsyrk DOUBLE COMPLEX for pzsyrk Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. PBLAS Routines 12 2429 With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. p?syr2k Performs a rank-2k update of a symmetric distributed matrix. Syntax call pssyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pdsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pcsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) call pzsyr2k(uplo, trans, n, k, alpha, a, ia, ja, desca, b, ib, jb, descb, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?syr2k routines perform a distributed matrix-matrix operation defined as sub(C):=alpha*sub(A)*sub(B)'+alpha*sub(B)*sub(A)'+ beta*sub(C), or sub(C):=alpha*sub(A)'*sub(B) +alpha*sub(B)'*sub(A) + beta*sub(C), where: alpha and beta are scalars, sub(C) is an n-by-n symmetric distributed matrix, sub(C)=C(ic:ic+n-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+k-1), if trans = 'N' or 'n', and sub(A)=A(ia:ia+k-1, ja:ja+n-1) otherwise. sub(B) is a distributed matrix, sub(B)=B(ib:ib+n-1, jb:jb+k-1), if trans = 'N' or 'n', and sub(B)=B(ib:ib+k-1, jb:jb+n-1) otherwise. Input Parameters uplo (global) CHARACTER*1. Specifies whether the upper or lower triangular part of the symmetric distributed matrix sub(C) is used: If uplo = 'U' or 'u', then the upper triangular part of the sub(C) is used. If uplo = 'L' or 'l', then the low triangular part of the sub(C) is used. trans (global) CHARACTER*1. Specifies the operation: if trans = 'N' or 'n', then sub(C) := alpha*sub(A)*sub(B)' + alpha*sub(B)*sub(A)' + beta*sub(C); if trans = 'T' or 't', then sub(C) := alpha*sub(B)'*sub(A) + alpha*sub(A)'*sub(B) + beta*sub(C). n (global) INTEGER. Specifies the order of the distributed matrix sub(C), n = 0. 12 Intel® Math Kernel Library Reference Manual 2430 k (global) INTEGER. On entry with trans = 'N' or 'n', k specifies the number of columns of the distributed matrices sub(A) and sub(B), and on entry with trans = 'T' or 't' , k specifies the number of rows of the distributed matrices sub(A) and sub(B), k = 0. alpha (global) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Specifies the scalar alpha. a (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_a, kla), where kla is LOCq(ja+k-1) when trans = 'N' or 'n', and is LOCq(ja+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_b, klb), where klb is LOCq(jb+k-1) when trans = 'N' or 'n', and is LOCq(jb+n-1) otherwise. Before entry with trans = 'N' or 'n', this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. beta (global) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Specifies the scalar beta. c (local) REAL for pssyr2k DOUBLE PRECISION for pdsyr2k COMPLEX for pcsyr2k DOUBLE COMPLEX for pzsyr2k Array, DIMENSION (lld_c, LOCq(jc+n-1)). Before entry with uplo = 'U' or 'u', this array contains n-by-n upper triangular part of the symmetric distributed matrix sub(C) and its strictly lower triangular part is not referenced. PBLAS Routines 12 2431 Before entry with uplo = 'L' or 'l', this array contains n-by-n lower triangular part of the symmetric distributed matrix sub(C) and its strictly upper triangular part is not referenced. ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c With uplo = 'U' or 'u', the upper triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. With uplo = 'L' or 'l', the lower triangular part of sub(C) is overwritten by the upper triangular part of the updated distributed matrix. p?tran Transposes a real distributed matrix. Syntax call pstran(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pdtran(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tran routines transpose a real distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*sub(A)', where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) REAL for pstran DOUBLE PRECISION for pdtran Specifies the scalar alpha. a (local) REAL for pstran DOUBLE PRECISION for pdtran Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). 12 Intel® Math Kernel Library Reference Manual 2432 ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) REAL for pstran DOUBLE PRECISION for pdtran Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) REAL for pstran DOUBLE PRECISION for pdtran Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tranu Transposes a distributed complex matrix. Syntax call pctranu(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztranu(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tranu routines transpose a complex distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*sub(A)', where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) COMPLEX for pctranu PBLAS Routines 12 2433 DOUBLE COMPLEX for pztranu Specifies the scalar alpha. a (local) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) COMPLEX for pctranu DOUBLE COMPLEX for pztranu Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?tranc Transposes a complex distributed matrix, conjugated. Syntax call pctranc(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) call pztranc(m, n, alpha, a, ia, ja, desca, beta, c, ic, jc, descc) Include Files • C: mkl_pblas.h Description The p?tranc routines transpose a complex distributed matrix. The operation is defined as sub(C):=beta*sub(C) + alpha*conjg(sub(A)'), where: alpha and beta are scalars, sub(C) is an m-by-n distributed matrix, sub(C)=C(ic:ic+m-1, jc:jc+n-1). sub(A) is a distributed matrix, sub(A)=A(ia:ia+n-1, ja:ja+m-1). 12 Intel® Math Kernel Library Reference Manual 2434 Input Parameters m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(C), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(C) , n = 0. alpha (global) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Specifies the scalar alpha. a (local) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Array, DIMENSION (lld_a, LOCq(ja+m-1)). This array contains the local pieces of the distributed matrix sub(A). ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. beta (global) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Specifies the scalar beta. When beta is equal to zero, then sub(C) need not be set on input. c (local) COMPLEX for pctranc DOUBLE COMPLEX for pztranc Array, DIMENSION (lld_c, LOCq(jc+n-1)). This array contains the local pieces of the distributed matrix sub(C). ic, jc (global) INTEGER. The row and column indices in the distributed matrix C indicating the first row and the first column of the submatrix sub(C), respectively. descc (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix C. Output Parameters c Overwritten by the updated submatrix. p?trmm Computes a scalar-matrix-matrix product (one matrix operand is triangular) for distributed matrices. Syntax call pstrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pdtrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pctrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pztrmm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) Include Files • C: mkl_pblas.h PBLAS Routines 12 2435 Description The p?trmm routines perform a matrix-matrix operation using triangular matrices. The operation is defined as sub(B) := alpha*op(sub(A))*sub(B) or sub(B) := alpha*sub(B)*op(sub(A)) where: alpha is a scalar, sub(B) is an m-by-n distributed matrix, sub(B)=B(ib:ib+m-1, jb:jb+n-1). A is a unit, or non-unit, upper or lower triangular distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L' or 'l', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R' or 'r'. op(sub(A)) is one of op(sub(A)) = sub(A), or op(sub(A)) = sub(A)', or op(sub(A)) = conjg(sub(A)'). Input Parameters side (global)CHARACTER*1. Specifies whether op(sub(A)) appears on the left or right of sub(B) in the operation: if side = 'L' or 'l', then sub(B) := alpha*op(sub(A))*sub(B); if side = 'R' or 'r', then sub(B) := alpha*sub(B)*op(sub(A)). uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix multiplication: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)' ; if transa = 'C' or 'c', then op(sub(A)) = conjg(sub(A)'). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(B), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(B), n = 0. alpha (global) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm Specifies the scalar alpha. When alpha is zero, then the arrayb need not be set before entry. a (local) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm 12 Intel® Math Kernel Library Reference Manual 2436 Array, DIMENSION (lld_a,ka), where ka is at least LOCq(1, ja+m-1) when side = 'L' or 'l' and is at least LOCq(1, ja+n-1) when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pstrmm DOUBLE PRECISION for pdtrmm COMPLEX for pctrmm DOUBLE COMPLEX for pztrmm Array, DIMENSION (lld_b, LOCq(1, jb+n-1)). Before entry, this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. Output Parameters b Overwritten by the transformed distributed matrix. p?trsm Solves a distributed matrix equation (one matrix operand is triangular). Syntax call pstrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pdtrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pctrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) call pztrsm(side, uplo, transa, diag, m, n, alpha, a, ia, ja, desca, b, ib, jb, descb) Include Files • C: mkl_pblas.h PBLAS Routines 12 2437 Description The p?trsm routines solve one of the following distributed matrix equations: op(sub(A))*X = alpha*sub(B), or X*op(sub(A)) = alpha*sub(B), where: alpha is a scalar, X and sub(B) are m-by-n distributed matrices, sub(B)=B(ib:ib+m-1, jb:jb+n-1); A is a unit, or non-unit, upper or lower triangular distributed matrix, sub(A)=A(ia:ia+m-1, ja:ja+m-1), if side = 'L' or 'l', and sub(A)=A(ia:ia+n-1, ja:ja+n-1), if side = 'R' or 'r'; op(sub(A)) is one of op(sub(A)) = sub(A), or op(sub(A)) = sub(A)', or op(sub(A)) = conjg(sub(A)'). The distributed matrix sub(B) is overwritten by the solution matrix X. Input Parameters side (global)CHARACTER*1. Specifies whether op(sub(A)) appears on the left or right of X in the equation: if side = 'L' or 'l', then op(sub(A))*X = alpha*sub(B); if side = 'R' or 'r', then X*op(sub(A)) = alpha*sub(B). uplo (global) CHARACTER*1. Specifies whether the distributed matrix sub(A) is upper or lower triangular: if uplo = 'U' or 'u', then the matrix is upper triangular; if uplo = 'L' or 'l', then the matrix is low triangular. transa (global) CHARACTER*1. Specifies the form of op(sub(A)) used in the matrix equation: if transa = 'N' or 'n', then op(sub(A)) = sub(A); if transa = 'T' or 't', then op(sub(A)) = sub(A)'; if transa = 'C' or 'c', then op(sub(A)) = conjg(sub(A)'). diag (global) CHARACTER*1. Specifies whether the matrix sub(A) is unit triangular: if diag = 'U' or 'u' then the matrix is unit triangular; if diag = 'N' or 'n', then the matrix is not unit triangular. m (global) INTEGER. Specifies the number of rows of the distributed matrix sub(B), m = 0. n (global) INTEGER. Specifies the number of columns of the distributed matrix sub(B), n = 0. alpha (global) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm Specifies the scalar alpha. When alpha is zero, then a is not referenced and b need not be set before entry. a (local) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm 12 Intel® Math Kernel Library Reference Manual 2438 Array, DIMENSION (lld_a, ka), where ka is at least LOCq(1, ja+m-1) when side = 'L' or 'l' and is at least LOCq(1, ja+n-1) when side = 'R' or 'r'. Before entry with uplo = 'U' or 'u', this array contains the local entries corresponding to the entries of the upper triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly lower triangular part of the distributed matrix sub(A) is not referenced. Before entry with uplo = 'L' or 'l', this array contains the local entries corresponding to the entries of the lower triangular distributed matrix sub(A), and the local entries corresponding to the entries of the strictly upper triangular part of the distributed matrix sub(A) is not referenced . When diag = 'U' or 'u', the local entries corresponding to the diagonal elements of the submatrix sub(A) are not referenced either, but are assumed to be unity. ia, ja (global) INTEGER. The row and column indices in the distributed matrix A indicating the first row and the first column of the submatrix sub(A), respectively. desca (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix A. b (local) REAL for pstrsm DOUBLE PRECISION for pdtrsm COMPLEX for pctrsm DOUBLE COMPLEX for pztrsm Array, DIMENSION (lld_b, LOCq(1, jb+n-1)). Before entry, this array contains the local pieces of the distributed matrix sub(B). ib, jb (global) INTEGER. The row and column indices in the distributed matrix B indicating the first row and the first column of the submatrix sub(B), respectively. descb (global and local) INTEGER array of dimension 8. The array descriptor of the distributed matrix B. Output Parameters b Overwritten by the solution distributed matrix X. PBLAS Routines 12 2439 12 Intel® Math Kernel Library Reference Manual 2440 Partial Differential Equations Support 13 The Intel® Math Kernel Library (Intel® MKL) provides tools for solving Partial Differential Equations (PDE). These tools are Trigonometric Transform interface routines (see Trigonometric Transform Routines) and Poisson Library (see Poisson Library Routines). Poisson Library is designed for fast solving of simple Helmholtz, Poisson, and Laplace problems. The solver is based on the Trigonometric Transform interface, which is, in turn, based on the Intel MKL Fast Fourier Transform (FFT) interface (refer to Fourier Transform Functions), optimized for Intel® processors. Direct use of the Trigonometric Transform routines may be helpful to those who have already implemented their own solvers similar to the one that the Poisson Library provides. As it may be hard enough to modify the original code so as to make it work with Poisson Library, you are encouraged to use fast (staggered) sine/cosine transforms implemented in the Trigonometric Transform interface to improve performance of your solver. Both Trigonometric Transform and Poisson Library routines can be called from C and Fortran 90, although the interfaces description uses C convention. Fortran 90 users can find routine calls specifics in the "Calling PDE Support Routines from Fortran 90" section. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Trigonometric Transform Routines In addition to the Fast Fourier Transform (FFT) interface, described in chapter "Fast Fourier Transforms", Intel® MKL supports the Real Discrete Trigonometric Transforms (sometimes called real-to-real Discrete Fourier Transforms) interface. In this manual, the interface is referred to as TT interface. It implements a group of routines (TT routines) used to compute sine/cosine, staggered sine/cosine, and twice staggered sine/cosine transforms (referred to as staggered2 sine/cosine transforms, for brevity). The TT interface provides much flexibility of use: you can adjust routines to your particular needs at the cost of manual tuning routine parameters or just call routines with default parameter values. The current Intel MKL implementation of the TT interface can be used in solving partial differential equations and contains routines that are helpful for Fast Poisson and similar solvers. To describe the Intel MKL TT interface, the C convention is used. Fortran users should refer to Calling PDE Support Routines from Fortran 90. For the list of Trigonometric Transforms currently implemented in Intel MKL TT interface, see Transforms Implemented. If you have got used to the FFTW interface (www.fftw.org), you can call the TT interface functions through real-to-real FFTW to Intel MKL wrappers without changing FFTW function calls in your code (refer to the "FFTW to Intel® MKL Wrappers for FFTW 3.x" section in Appendix F for details). However, you are strongly encouraged to use the native TT interface for better performance. Another reason why you should use the wrappers cautiously is that TT and the real-to-real FFTW interfaces are not fully compatible and some features of the real-to-real FFTW, such as strides and multidimensional transforms, are not available through wrappers. 2441 Transforms Implemented TT routines allow computing the following transforms: Forward sine transform Backward sine transform Forward staggered sine transform Backward staggered sine transform Forward staggered2 sine transform Backward staggered2 sine transform Forward cosine transform Backward cosine transform 13 Intel® Math Kernel Library Reference Manual 2442 Forward staggered cosine transform Backward staggered cosine transform Forward staggered2 cosine transform Backward staggered2 cosine transform NOTE The size of the transform n can be any integer greater or equal to 2. Sequence of Invoking TT Routines Computation of a transform using TT interface is conceptually divided into four steps, each of which is performed via a dedicated routine. Table "TT Interface Routines" lists the routines and briefly describes their purpose and use. Most TT routines have versions operating with single-precision and double-precision data. Names of such routines begin respectively with "s" and "d". The wildcard "?" stands for either of these symbols in routine names. TT Interface Routines Routine Description ?_init_trig_transform Initializes basic data structures of Trigonometric Transforms. ?_commit_trig_transform Checks consistency and correctness of user-defined data as well as creates a data structure to be used by Intel MKL FFT interface1. Partial Differential Equations Support 13 2443 Routine Description ?_forward_trig_transform ?_backward_trig_transform Computes a forward/backward Trigonometric Transform of a specified type using the appropriate formula (see Transforms Implemented). free_trig_transform Cleans the memory used by a data structure needed for calling FFT interface1. 1TT routines call Intel MKL FFT interface for better performance. To find a transformed vector for a particular input vector only once, the Intel MKL TT interface routines are normally invoked in the order in which they are listed in Table "TT Interface Routines". NOTE Though the order of invoking TT routines may be changed, it is highly recommended to follow the above order of routine calls. The diagram in Figure "Typical Order of Invoking TT Interface Routines" indicates the typical order in which TT interface routines can be invoked in a general case (prefixes and suffixes in routine names are omitted). Typical Order of Invoking TT Interface Routines A general scheme of using TT routines for double-precision computations is shown below. A similar scheme holds for single-precision computations with the only difference in the initial letter of routine names. ... d_init_trig_transform(&n, &tt_type, ipar, dpar, &ir); /* Change parameters in ipar if necessary. */ /* Note that the result of the Transform will be in f ! If you want to preserve the data stored in f, save them before this place in your code */ d_commit_trig_transform(f, &handle, ipar, dpar, &ir); d_forward_trig_transform(f, &handle, ipar, dpar, &ir); d_backward_trig_transform(f, &handle, ipar, dpar, &ir); free_trig_transform(&handle, ipar, &ir); /* here the user may clean the memory used by f, dpar, ipar */ ... You can find examples of Fortran 90 and C code that use TT interface routines to solve one-dimensional Helmholtz problem in the examples\pdettf\source and examples\pdettc\source folders of your Intel MKL directory. 13 Intel® Math Kernel Library Reference Manual 2444 Interface Description All types in this documentation are standard C types: int, float, and double. Fortran 90 users can call the routines with INTEGER, REAL, and DOUBLE PRECISION Fortran types, respectively (see examples in the examples\pdettf\source and examples\pdettc\source folders of your Intel MKL directory). The interface description uses the built-in type int for integer values. If you employ the ILP64 interface, read this type as long long int (or INTEGER*8 for Fortran). For more information, refer to the Intel MKL User's Guide. Routine Options All TT routines use parameters to pass various options to one another. These parameters are arrays ipar, dpar and spar. Values for these parameters should be specified very carefully (see Common Parameters). You can change these values during computations to meet your needs. WARNING To avoid failure or wrong results, you must provide correct and consistent parameters to the routines. User Data Arrays TT routines take arrays of user data as input. For example, user arrays are passed to the routine d_forward_trig_transform to compute a forward Trigonometric Transform. To minimize storage requirements and improve the overall run-time efficiency, Intel MKL TT routines do not make copies of user input arrays. NOTE If you need a copy of your input data arrays, save them yourself. TT Routines The section gives detailed description of TT routines, their syntax, parameters and values they return. Double-precision and single-precision versions of the same routine are described together. TT routines call Intel MKL FFT interface (described in section "FFT Functions" in chapter "Fast Fourier Transforms"), which enhances performance of the routines. ?_init_trig_transform Initializes basic data structures of a Trigonometric Transform. Syntax void d_init_trig_transform(int *n, int *tt_type, int ipar[], double dpar[], int *stat); void s_init_trig_transform(int *n, int *tt_type, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Partial Differential Equations Support 13 2445 Input Parameters n int*. Contains the size of the problem, which should be a positive integer greater than 1. Note that data vector of the transform, which other TT routines will use, must have size n+1 for all but staggered2 transforms. Staggered2 transforms require the vector of size n. tt_type int*. Contains the type of transform to compute, defined via a set of named constants. The following constants are available in the current implementation of TT interface: MKL_SINE_TRANSFORM, MKL_STAGGERED_SINE_TRANSFORM, MKL_STAGGERED2_SINE_TRANSFORM; MKL_COSINE_TRANSFORM, MKL_STAGGERED_COSINE_TRANSFORM, MKL_STAGGERED2_COSINE_TRANSFORM. Output Parameters ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. stat int*. Contains the routine completion status, which is also written to ipar[6]. The status should be 0 to proceed to other TT routines. Description The ?_init_trig_transform routine initializes basic data structures for Trigonometric Transforms of appropriate precision. After a call to ?_init_trig_transform, all subsequently invoked TT routines use values of ipar and dpar (spar) array parameters returned by ?_init_trig_transform. The routine initializes the entire array ipar. In the dpar or spar array, ?_init_trig_transform initializes elements that do not depend upon the type of transform. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. You can skip calling the initialization routine in your code. For more information, see Caveat on Parameter Modifications. Return Values stat= 0 The routine successfully completed the task. In general, to proceed with computations, the routine should complete with this stat value. stat= -99999 The routine failed to complete the task. ?_commit_trig_transform Checks consistency and correctness of user's data as well as initializes certain data structures required to perform the Trigonometric Transform. Syntax void d_commit_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_commit_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 13 Intel® Math Kernel Library Reference Manual 2446 • C: mkl_trig_transforms.h Input Parameters f double for d_commit_trig_transform, float for s_commit_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. Contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. These restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. The routine initializes most elements of this array. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. The routine initializes most elements of this array. Output Parameters handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. dpar Contains double-precision data needed for Trigonometric Transform computations. On output, the entire array is initialized. spar Contains single-precision data needed for Trigonometric Transform computations. On output, the entire array is initialized. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine ?_commit_trig_transform checks consistency and correctness of the parameters to be passed to the transform routines ?_forward_trig_transform and/or ?_backward_trig_transform. The routine also initializes the following data structures: handle, dpar in case of d_commit_trig_transform, and spar in case of s_commit_trig_transform. The ?_commit_trig_transform routine initializes only those elements of dpar or spar that depend upon the type of transform, defined in the ?_init_trig_transform routine and passed to ?_commit_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine performs only a basic check for correctness and Partial Differential Equations Support 13 2447 consistency of the parameters. If you are going to modify parameters of TT routines, see the Caveat on Parameter Modifications section. Unlike ?_init_trig_transform, the ?_commit_trig_transform routine is mandatory, and you cannot skip calling it in your code. Return Values stat= 11 The routine produced some warnings and made some changes in the parameters to achieve their correctness and/or consistency. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 10 The routine made some changes in the parameters to achieve their correctness and/or consistency. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 1 The routine produced some warnings. You may proceed with computations by assigning ipar[6]=0 if you are sure that the parameters are correct. stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because the initialization failed to complete or the parameter ipar[0] was altered by mistake. NOTE Although positive values of stat usually indicate minor problems with the input data and Trigonometric Transform computations can be continued, you are highly recommended to investigate the problem first and achieve stat=0. ?_forward_trig_transform Computes the forward Trigonometric Transform of type specified by the parameter. Syntax void d_forward_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_forward_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters f double for d_forward_trig_transform, 13 Intel® Math Kernel Library Reference Manual 2448 float for s_forward_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. On input, contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. The above restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. Output Parameters f Contains the transformed vector on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine computes the forward Trigonometric Transform of type defined in the ?_init_trig_transform routine and passed to ?_forward_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. The other data that facilitates the computation is created by ?_commit_trig_transform and supplied in dpar or spar. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine has a commit step, which calls the ?_commit_trig_transform routine. The transform is computed according to formulas given in the Transforms Implemented section. The routine replaces the input vector f with the transformed vector. NOTE If you need a copy of the data vector f to be transformed, make the copy before calling the ? _forward_trig_transform routine. Return Values stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. Partial Differential Equations Support 13 2449 • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because its commit step failed to complete or the parameter ipar[0] was altered by mistake. ?_backward_trig_transform Computes the backward Trigonometric Transform of type specified by the parameter. Syntax void d_backward_trig_transform(double f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], double dpar[], int *stat); void s_backward_trig_transform(float f[], DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], float spar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters f double for d_backward_trig_transform, float for s_backward_trig_transform, array of size n for staggered2 transforms and of size n+1 for all other transforms, where n is the size of the problem. On input, contains data vector to be transformed. Note that the following values should be 0.0 up to rounding errors: • f[0] and f[n] for sine transforms • f[n] for staggered cosine transforms • f[0] for staggered sine transforms. Otherwise, the routine will produce a warning, and the result of the computations for sine transforms may be wrong. The above restrictions meet the requirements of the Poisson Library (described in the Poisson Library Routines section), which the TT interface is primarily designed for. handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. dpar double array of size 5n/2+2. Contains double-precision data needed for Trigonometric Transform computations. spar float array of size 5n/2+2. Contains single-precision data needed for Trigonometric Transform computations. 13 Intel® Math Kernel Library Reference Manual 2450 Output Parameters f Contains the transformed vector on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The routine computes the backward Trigonometric Transform of type defined in the ? _init_trig_transform routine and passed to ?_backward_trig_transform with the ipar array. The size of the problem n, which determines sizes of the array parameters, is also passed to the routine with the ipar array and defined in the previously called ?_init_trig_transform routine. The other data that facilitates the computation is created by ?_commit_trig_transform and supplied in dpar or spar. For a detailed description of arrays ipar, dpar and spar, refer to the Common Parameters section. The routine has a commit step, which calls the ?_commit_trig_transform routine. The transform is computed according to formulas given in the Transforms Implemented section. The routine replaces the input vector f with the transformed vector. NOTE If you need a copy of the data vector f to be transformed, make the copy before calling the ? _backward_trig_transform routine. Return Values stat= 0 The routine completed the task normally. stat= -100 The routine stopped for any of the following reasons: • An error in the user's data was encountered. • Data in ipar, dpar or spar parameters became incorrect and/or inconsistent as a result of modifications. stat= -1000 The routine stopped because of an FFT interface error. stat= -10000 The routine stopped because its commit step failed to complete or the parameter ipar[0] was altered by mistake. free_trig_transform Cleans the memory allocated for the data structure used by the FFT interface. Syntax void free_trig_transform(DFTI_DESCRIPTOR_HANDLE *handle, int ipar[], int *stat); Include Files • FORTRAN 90: mkl_trig_transforms.f90 • C: mkl_trig_transforms.h Input Parameters ipar int array of size 128. Contains integer data needed for Trigonometric Transform computations. Partial Differential Equations Support 13 2451 handle DFTI_DESCRIPTOR_HANDLE*. The data structure used by Intel MKL FFT interface (for details, refer to section "FFT Functions" in chapter "Fast Fourier Transforms"). Output Parameters handle The data structure used by Intel MKL FFT interface. Memory allocated for the structure is released on output. ipar Contains integer data needed for Trigonometric Transform computations. On output, ipar[6] is updated with the stat value. stat int*. Contains the routine completion status, which is also written to ipar[6]. Description The free_trig_transform routine cleans the memory used by the handle structure, needed for Intel MKL FFT functions. To release the memory allocated for other parameters, include cleaning of the memory in your code. Return Values stat= 0 The routine completed the task normally. stat= -1000 The routine stopped because of an FFT interface error. stat= -99999 The routine failed to complete the task. Common Parameters This section provides description of array parameters that hold TT routine options: ipar, dpar and spar. NOTE Initial values are assigned to the array parameters by the appropriate ? _init_trig_transform and ?_commit_trig_transform routines. ipar int array of size 128, holds integer data needed for Trigonometric Transform computations. Its elements are described in Table "Elements of the ipar Array": Elements of the ipar Array Index Description 0 Contains the size of the problem to solve. The ?_init_trig_transform routine sets ipar[0]=n, and all subsequently called TT routines use ipar[0] as the size of the transform. 1 Contains error messaging options: • ipar[1]=-1 indicates that all error messages will be printed to the file MKL_Trig_Transforms_log.txt in the folder from which the routine is called. If the file does not exist, the routine tries to create it. If the attempt fails, the routine prints information that the file cannot be created to the standard output device. • ipar[1]=0 indicates that no error messages will be printed. • ipar[1]=1 (default) indicates that all error messages will be printed to the preconnected default output device (usually, screen). In case of errors, each TT routine assigns a non-zero value to stat regardless of the ipar[1] setting. 13 Intel® Math Kernel Library Reference Manual 2452 Index Description 2 Contains warning messaging options: • ipar[2]=-1 indicates that all warning messages will be printed to the file MKL_Trig_Transforms_log.txt in the directory from which the routine is called. If the file does not exist, the routine tries to create it. If the attempt fails, the routine prints information that the file cannot be created to the standard output device. • ipar[2]=0 indicates that no warning messages will be printed. • ipar[2]=1 (default) indicates that all warning messages will be printed to the preconnected default output device (usually, screen). In case of warnings, the stat parameter will acquire a non-zero value regardless of the ipar[2] setting. 3 through 4 Reserved for future use. 5 Contains the type of the transform. The ?_init_trig_transform routine sets ipar[5]=tt_type, and all subsequently called TT routines use ipar[5] as the type of the transform. 6 Contains the stat value returned by the last completed TT routine. Used to check that the previous call to a TT routine completed with stat=0. 7 Informs the ?_commit_trig_transform routines whether to initialize data structures dpar (spar) and handle. ipar[7]=0 indicates that the routine should skip the initialization and only check correctness and consistency of the parameters. Otherwise, the routine initializes the data structures. The default value is 1. The possibility to check correctness and consistency of input data without initializing data structures dpar, spar and handle enables avoiding performance losses in a repeated use of the same transform for different data vectors. Note that you can benefit from the opportunity that ipar[7] gives only if you are sure to have supplied proper tolerance value in the dpar or spar array. Otherwise, avoid tuning this parameter. 8 Contains message style options for TT routines. If ipar[8]=0 then TT routines print all error and warning messages in Fortran-style notations. Otherwise, TT routines print the messages in C-style notations. The default value is 1. When selecting between these notations, mind that by default, numbering of elements in C arrays starts from 0 and in Fortran, it starts from 1. For example, for a C-style message "parameter ipar[0]=3 should be an even integer", the corresponding Fortran-style message will be "parameter ipar(1)=3 should be an even integer". The use of ipar[8] enables you to view messages in a more convenient style. 9 Specifies the number of OpenMP threads to run TT routines in the OpenMP environment of the Poisson Library. The default value is 1. You are highly recommended not to alter this value. See also Caveat on Parameter Modifications. 10 Specifies the mode of compatibility with FFTW. The default value is 0. Set the value to 1 to invoke compatibility with FFTW. In the latter case, results will not be normalized, because FFTW does not do this. It is highly recommended not to alter this value, but rather use real-to-real FFTW to MKL wrappers, described in the "FFTW to Intel® MKL Wrappers for FFTW 3.x" section in Appendix F. See also Caveat on Parameter Modifications. 11 through 127 Reserved for future use. Partial Differential Equations Support 13 2453 NOTE You may declare the ipar array in your code as int ipar[11]. However, for compatibility with later versions of Intel MKL TT interface, which may require more ipar values, it is highly recommended to declare ipar as int ipar[128]. Arrays dpar and spar are the same except in the data precision: dpar double array of size 5n/2+2, holds data needed for double-precision routines to perform TT computations. This array is initialized in the d_init_trig_transform and d_commit_trig_transform routines. spar float array of size 5n/2+2, holds data needed for single-precision routines to perform TT computations. This array is initialized in the s_init_trig_transform and s_commit_trig_transform routines. As dpar and spar have similar elements in respective positions, the elements are described together in Table "Elements of the dpar and spar Arrays": Elements of the dpar and spar Arrays Index Description 0 Contains the first absolute tolerance used by the appropriate ? _commit_trig_transform routine. For a staggered cosine or a sine transform, f[n] should be equal to 0.0 and for a staggered sine or a sine transform, f[0] should be equal to 0.0. The ?_commit_trig_transform routine checks whether absolute values of these parameters are below dpar[0]*n or spar[0]*n, depending on the routine precision. To suppress warnings resulting from tolerance checks, set dpar[0] or spar[0] to a sufficiently large number. 1 Reserved for future use. 2 through 5n/2+1 Contain tabulated values of trigonometric functions. Contents of the elements depend upon the type of transform tt_type, set up in the ?_commit_trig_transform routine: • If tt_type=MKL_SINE_TRANSFORM, the transform uses only the first n/2 array elements, which contain tabulated sine values. • If tt_type=MKL_STAGGERED_SINE_TRANSFORM, the transform uses only the first 3n/2 array elements, which contain tabulated sine and cosine values. • If tt_type=MKL_STAGGERED2_SINE_TRANSFORM, the transform uses all the 5n/2 array elements, which contain tabulated sine and cosine values. • If tt_type=MKL_COSINE_TRANSFORM, the transform uses only the first n array elements, which contain tabulated cosine values. • If tt_type=MKL_STAGGERED_COSINE_TRANSFORM, the transform uses only the first 3n/2 elements, which contain tabulated sine and cosine values. • If tt_type=MKL_STAGGERED2_COSINE_TRANSFORM, the transform uses all the 5n/ 2 elements, which contain tabulated sine and cosine values. NOTE To save memory, you can define the array size depending upon the type of transform. Caveat on Parameter Modifications Flexibility of the TT interface enables you to skip calling the ?_init_trig_transform routine and to initialize the basic data structures explicitly in your code. You may also need to modify the contents of ipar, dpar and spar arrays after initialization. When doing so, provide correct and consistent data in the arrays. Mistakenly altered arrays cause errors or wrong computation. You can perform a basic check for correctness and consistency of parameters by calling the ?_commit_trig_transform routine; however, this does not ensure the correct result of a transform but only reduces the chance of errors or wrong results. 13 Intel® Math Kernel Library Reference Manual 2454 NOTE To supply correct and consistent parameters to TT routines, you should have considerable experience in using the TT interface and good understanding of elements that the ipar, spar and dpar arrays contain and dependencies between values of these elements. However, in rare occurrences, even advanced users might fail to compute a transform using TT routines after the parameter modifications. In cases like these, refer for technical support at http://www.intel.com/ software/products/support/ . WARNING The only way that ensures proper computation of the Trigonometric Transforms is to follow a typical sequence of invoking the routines and not change the default set of parameters. So, avoid modifications of ipar, dpar and spar arrays unless a strong need arises. Implementation Details Several aspects of the Intel MKL TT interface are platform-specific and language-specific. To promote portability across platforms and ease of use across different languages, users are provided with the TT language-specific header files to include in their code. Currently, the following of them are available: • mkl_trig_transforms.h, to be used together with mkl_dfti.h, for C programs. • mkl_trig_transforms.f90, to be used together with mkl_dfti.f90, for Fortran 90 programs. NOTE Use of the Intel MKL TT software without including one of the above header files is not supported. C-specific Header File The C-specific header file defines the following function prototypes: void d_init_trig_transform(int *, int *, int *, double *, int *); void d_commit_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void d_forward_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void d_backward_trig_transform(double *, DFTI_DESCRIPTOR_HANDLE *, int *, double *, int *); void s_init_trig_transform(int *, int *, int *, float *, int *); void s_commit_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void s_forward_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void s_backward_trig_transform(float *, DFTI_DESCRIPTOR_HANDLE *, int *, float *, int *); void free_trig_transform(DFTI_DESCRIPTOR_HANDLE *, int *, int *); Partial Differential Equations Support 13 2455 Fortran-Specific Header File The Fortran90-specific header file defines the following function prototypes: SUBROUTINE D_INIT_TRIG_TRANSFORM(n, tt_type, ipar, dpar, stat) INTEGER, INTENT(IN) :: n, tt_type INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_INIT_TRIG_TRANSFORM SUBROUTINE D_COMMIT_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_COMMIT_TRIG_TRANSFORM SUBROUTINE D_FORWARD_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_FORWARD_TRIG_TRANSFORM SUBROUTINE D_BACKWARD_TRIG_TRANSFORM(f, handle, ipar, dpar, stat) REAL(8), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) REAL(8), INTENT(INOUT) :: dpar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE D_BACKWARD_TRIG_TRANSFORM SUBROUTINE S_INIT_TRIG_TRANSFORM(n, tt_type, ipar, spar, stat) INTEGER, INTENT(IN) :: n, tt_type INTEGER, INTENT(INOUT) :: ipar(*) REAL(4), INTENT(INOUT) :: spar(*) INTEGER, INTENT(OUT) :: stat END SUBROUTINE S_INIT_TRIG_TRANSFORM SUBROUTINE S_COMMIT_TRIG_TRANSFORM(f, handle, ipar, spar, stat) REAL(4), INTENT(INOUT) :: f(*) TYPE(DFTI_DESCRIPTOR), POINTER :: handle INTEGER, INTENT(INOUT) :: ipar(*) 13 Intel® Math Kernel Library Reference Manual 2456 Fortran 90 specifics of the TT routines usage are similar for all Intel MKL PDE support tools and described in the Calling PDE Support Routines from Fortran 90 section. Poisson Library Routines In addition to Real Discrete Trigonometric Transforms (TT) interface (refer to Trigonometric Transform Routines), Intel® MKL supports the Poisson Library interface, referred to as PL interface. The interface implements a group of routines (PL routines) used to compute a solution of Laplace, Poisson, and Helmholtz problems of special kind using discrete Fourier transforms. Laplace and Poisson problems are special cases of a more general Helmholtz problem. The problems being solved are defined more exactly in the Poisson Library Implemented subsection. The PL interface provides much flexibility of use: you can adjust routines to your particular needs at the cost of manual tuning routine parameters or just call routines with default parameter values. The interface can adjust style of error and warning messages to C or Fortran notations by setting up a dedicated parameter. This adds convenience to debugging, because users can read information in the way that is natural for their code. The Intel MKL PL interface currently contains only routines that implement the following solvers: • Fast Laplace, Poisson and Helmholtz solvers in a Cartesian coordinate system • Fast Poisson and Helmholtz solvers in a spherical coordinate system. To describe the Intel MKL PL interface, the C convention is used. Fortran usage specifics can be found in the Calling PDE Support Routines from Fortran 90 section. NOTE Fortran users should mind that respective array indices in Fortran increase by 1. Poisson Library Implemented PL routines enable approximate solving of certain two-dimensional and three-dimensional problems. Figure "Structure of the Poisson Library" shows the general structure of the Poisson Library. Structure of the Poisson Library Partial Differential Equations Support 13 2457 Sections below provide details of the problems that can be solved using Intel MKL PL. Two-Dimensional Problems Notational Conventions The PL interface description uses the following notation for boundaries of a rectangular domain ax < x < bx, ay < y < by on a Cartesian plane: bd_ax = {x = ax, ay = y = by}, bd_bx = {x = bx, ay = y = by} bd_ay = {ax = x = bx, y = ay}, bd_by = {ax = x = bx, y = by}. The wildcard "+" may stand for any of the symbols ax, bx, ay, by, so that bd_+ denotes any of the above boundaries. The PL interface description uses the following notation for boundaries of a rectangular domain af < f < bf, a? < ? < b? on a sphere 0 = f = 2 p, 0 = ? = p: bd_af = {f = af, a? = ? = b?}, bd_bf = {f = bf, a? = ? = b?} bd_a? = {af = f = bf, ? = a?}, bd_b? = {af = f = bf, ? = b?}. The wildcard "~" may stand for any of the symbols af, bf, a?, b?, so that bd_~ denotes any of the above boundaries. Two-dimensional (2D) Helmholtz problem on a Cartesian plane The 2D Helmholtz problem is to find an approximate solution of the Helmholtz equation in a rectangle, that is, a rectangular domain ax< x < bx, ay< y < by, with one of the following boundary conditions on each boundary bd_+: • The Dirichlet boundary condition • The Neumann boundary condition where n= -x on bd_ax, n= x on bd_bx, n= -y on bd_ay, n= y on bd_by. Two-dimensional (2D) Poisson problem on a Cartesian plane The Poisson problem is a special case of the Helmholtz problem, when q=0. The 2D Poisson problem is to find an approximate solution of the Poisson equation 13 Intel® Math Kernel Library Reference Manual 2458 in a rectangle ax< x < bx, ay< y < by with the Dirichlet or Neumann boundary condition on each boundary bd_+. In case of a problem with the Neumann boundary condition on the entire boundary, you can find the solution of the problem only up to a constant. In this case, the Poisson Library will compute the solution that provides the minimal Euclidean norm of a residual. Two-dimensional (2D) Laplace problem on a Cartesian plane The Laplace problem is a special case of the Helmholtz problem, when q=0 and f(x, y)=0. The 2D Laplace problem is to find an approximate solution of the Laplace equation in a rectangle ax< x < bx, ay< y < by with the Dirichlet or Neumann boundary condition on each boundary bd_+. Helmholtz problem on a sphere The Helmholtz problem on a sphere is to find an approximate solution of the Helmholtz equation in a spherical rectangle that is, a domain bounded by angles af= f = bf, a?= ? = b?, with boundary conditions for particular domains listed in Table "Details of Helmholtz Problem on a Sphere". Details of Helmholtz Problem on a Sphere Domain on a sphere Boundary condition Periodic/nonperiodic case Rectangular, that is, bf - af < 2 p and b? - a? < p Homogeneous Dirichlet boundary conditions on each boundary bd_~ non-periodic Where af = 0, bf = 2 p, and b? - a? < p Homogeneous Dirichlet boundary conditions on the boundaries bd_a? and bd_b? periodic Entire sphere, that is, af = 0, bf = 2 p, a? = 0, and b? = p Boundary condition at the poles. periodic Partial Differential Equations Support 13 2459 Poisson problem on a sphere The Poisson problem is a special case of the Helmholtz problem, when q=0. The Poisson problem on a sphere is to find an approximate solution of the Poisson equation in a spherical rectangle af= f = bf, a?= ? = b? in cases listed in Table "Details of Helmholtz Problem on a Sphere". The solution to the Poisson problem on the entire sphere can be found up to a constant only. In this case, Poisson Library will compute the solution that provides the minimal Euclidean norm of a residual. Approximation of 2D problems To find an approximate solution for any of the 2D problems, a uniform mesh is built in the rectangular domain: in the Cartesian case and in the spherical case. Poisson Library uses the standard five-point finite difference approximation on this mesh to compute the approximation to the solution: • In the Cartesian case, the values of the approximate solution will be computed in the mesh points (xi , yj) provided that the user knows the values of the right-hand side f(x, y) in these points and the values of the appropriate boundary functions G(x, y) and/or g(x,y) in the mesh points laying on the boundary of the rectangular domain. • In the spherical case, the values of the approximate solution will be computed in the mesh points (fi , ?j) provided that the user knows the values of the right-hand side f(f, ?) in these points. NOTE The number of mesh intervals nf in the f direction of a spherical mesh must be even in the periodic case. The current implementation of the Poisson Library does not support meshes with the number of intervals that does not meet this condition. Three-Dimensional Problems Notational Conventions The PL interface description uses the following notation for boundaries of a parallelepiped domain ax < x < bx, ay < y _( ) where • indicates the data type: s real, single precision d real, double precision • indicates the task type: trnlsp nonlinear least squares problem without constraints trnlspbc nonlinear least squares problem with boundary constraints jacobi computation of the Jacobian matrix using central differences • indicates an action on the task: init initializes the solver check checks correctness of the input parameters solve solves the problem get retrieves the number of iterations, the stop criterion, the initial residual, and the final residual delete releases the allocated data Nonlinear Least Squares Problem without Constraints The nonlinear least squares problem without constraints can be described as follows: where F(x) : Rn ? Rm is a twice differentiable function in Rn. 14 Intel® Math Kernel Library Reference Manual 2496 Solving a nonlinear least squares problem means searching for the best approximation to the vector y with the model function fi(x) and nonlinear variables x. The best approximation means that the sum of squares of residuals yi - fi(x) is the minimum. See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f.f and ex_nlsqp_c.c, respectively). RCI TR Routines Routine Name Operation ?trnlsp_init Initializes the solver. ?trnlsp_check Checks correctness of the input parameters. ?trnlsp_solve Solves a nonlinear least squares problem using the Trust-Region algorithm. ?trnlsp_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. ?trnlsp_delete Releases allocated data. ?trnlsp_init Initializes the solver of a nonlinear least squares problem. Syntax Fortran: res = strnlsp_init(handle, n, m, x, eps, iter1, iter2, rs) res = dtrnlsp_init(handle, n, m, x, eps, iter1, iter2, rs) C: res = strnlsp_init(&handle, &n, &m, x, eps, &iter1, &iter2, &rs); res = dtrnlsp_init(&handle, &n, &m, x, eps, &iter1, &iter2, &rs); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_init routine initializes the solver. After initialization, all subsequent invocations of the ?trnlsp_solve routine should use the values of the handle returned by ?trnlsp_init. The eps array contains the stopping criteria: eps Value Description 1 ? < eps(1) 2 ||F(x)||2 < eps(2) 3 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n Nonlinear Optimization Problem Solvers 14 2497 eps Value Description 4 ||s||2 < eps(4) 5 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) 6 The trial step precision. If eps(6) = 0, then the trial step meets the required precision (= 1.0D-10). Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F(x). x REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Array of size n. Initial guess. eps REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Array of size 6; contains stopping criteria. See the values in the Description section. iter1 INTEGER. Specifies the maximum number of iterations. iter2 INTEGER. Specifies the maximum number of iterations of trial step calculation. rs REAL for strnlsp_init DOUBLE PRECISION for dtrnlsp_init Definition of initial size of the trust region (boundary of the trial step). The minimum value is 0.1, and the maximum value is 100.0. Based on your knowledge of the objective function and initial guess you can increase or decrease the initial trust region. It can influence the iteration process, for example, the direction of the iteration process and the number of iterations. The default value is 100.0. Output Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?trnlsp_solve 14 Intel® Math Kernel Library Reference Manual 2498 ?trnlsp_check Checks the correctness of handle and arrays containing Jacobian matrix, objective function, and stopping criteria. Syntax Fortran: res = strnlsp_check(handle, n, m, fjac, fvec, eps, info) res = dtrnlsp_check(handle, n, m, fjac, fvec, eps, info) C: res = strnlsp_check(&handle, &n, &m, fjac, fvec, eps, info); res = dtrnlsp_check(&handle, &n, &m, fjac, fvec, eps, info); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_check routine checks the arrays passed into the solver as input parameters. If an array contains any INF or NaN values, the routine sets the flag in output array info (see the description of the values returned in the Output Parameters section for the info array). Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. n INTEGER. Length of x. m INTEGER. Length of F(x). fjac REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size m by n. Contains the Jacobian matrix of the function. fvec REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). eps REAL for strnlsp_check DOUBLE PRECISION for dtrnlsp_check Array of size 6; contains stopping criteria. See the values in the Description section of the ?trnlsp_init. Output Parameters info INTEGER Array of size 6. Results of input parameter checking: Nonlinear Optimization Problem Solvers 14 2499 Parameter Used for Val ue Description C Language Fortran Language info(0) info(1) Flags for handle 0 The handle is valid. 1 The handle is not allocated. info(1) info(2) Flags for fjac 0 The fjac array is valid. 1 The fjac array is not allocated 2 The fjac array contains NaN. 3 The fjac array contains Inf. info(2) info(3) Flags for fvec 0 The fvec array is valid. 1 The fvec array is not allocated 2 The fvec array contains NaN. 3 The fvec array contains Inf. info(3) info(4) Flags for eps 0 The eps array is valid. 1 The eps array is not allocated 2 The eps array contains NaN. 3 The eps array contains Inf. 4 The eps array contains a value less than or equal to zero. res INTEGER. Information about completion of the task. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_solve Solves a nonlinear least squares problem using the TR algorithm. Syntax Fortran: res = strnlsp_solve(handle, fvec, fjac, RCI_Request) res = dtrnlsp_solve(handle, fvec, fjac, RCI_Request) C: res = strnlsp_solve(&handle, fvec, fjac, &RCI_Request); res = dtrnlsp_solve(&handle, fvec, fjac, &RCI_Request); Include Files • Fortran: mkl_rci.fi 14 Intel® Math Kernel Library Reference Manual 2500 • C: mkl_rci.h Description The ?trnlsp_solve routine uses the TR algorithm to solve nonlinear least squares problems. The problem is stated as follows: where • F(x):Rn ? Rm • m = n From a current point xcurrent, the algorithm uses the trust-region approach: to get xnew = xcurrent + s that satisfies where • J(x) is the Jacobian matrix • s is the trial step • ||s||2 = ?current The RCI_Request parameter provides additional information: RCI_Request Value Description 2 Request to calculate the Jacobian matrix and put the result into fjac 1 Request to recalculate the function at vector X and put the result into fvec 0 One successful iteration step on the current trust-region radius (that does not mean that the value of x has changed) -1 The algorithm has exceeded the maximum number of iterations -2 ? < eps(1) -3 ||F(x)||2 < eps(2) -4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n -5 ||s||2 < eps(4) -6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Nonlinear Optimization Problem Solvers 14 2501 Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. fvec REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). fjac REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size (m,n). Contains the Jacobian matrix of the function. Output Parameters fvec REAL for strnlsp_solve DOUBLE PRECISION for dtrnlsp_solve Array of size m. Updated function evaluated at x. RCI_Request INTEGER. Informs about the task stage. See the Description section for the parameter values and their meaning. res INTEGER. Indicates the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. Syntax Fortran: res = strnlsp_get(handle, iter, st_cr, r1, r2) res = dtrnlsp_get(handle, iter, st_cr, r1, r2) C: res = strnlsp_get(&handle, &iter, &st_cr, &r1, &r2); res = dtrnlsp_get(&handle, &iter, &st_cr, &r1, &r2); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine retrieves the current number of iterations, the stop criterion, the initial residual, and final residual. The initial residual is the value of the functional (||y - f(x)||) of the initial x values provided by the user. 14 Intel® Math Kernel Library Reference Manual 2502 The final residual is the value of the functional (||y - f(x)||) of the final x resulting from the algorithm operation. The st_cr parameter contains the stop criterion: st_cr Value Description 1 The algorithm has exceeded the maximum number of iterations 2 ? < eps(1) 3 ||F(x)||2 < eps(2) 4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 5 ||s||2 < eps(4) 6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters iter INTEGER. Contains the current number of iterations. st_cr INTEGER. Contains the stop criterion. See the Description section for the parameter values and their meanings. r1 REAL for strnlsp_get DOUBLE PRECISION for dtrnlsp_get Contains the residual, (||y - f(x)||) given the initial x. r2 REAL for strnlsp_get DOUBLE PRECISION for dtrnlsp_get Contains the final residual, that is, the value of the functional (||y - f(x)||) of the final x resulting from the algorithm operation. res INTEGER. Indicates the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlsp_delete Releases allocated data. Syntax Fortran: res = strnlsp_delete(handle) res = dtrnlsp_delete(handle) Nonlinear Optimization Problem Solvers 14 2503 C: res = strnlsp_delete(&handle); res = dtrnlsp_delete(&handle); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlsp_delete routine releases all memory allocated for the handle. This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _TRNSP_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Indicates the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. Nonlinear Least Squares Problem with Linear (Bound) Constraints The nonlinear least squares problem with linear bound constraints is very similar to the nonlinear least squares problem without constraints but it has the following constraints: See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_bc_f.f and ex_nlsqp_bc_c.c, respectively). RCI TR Routines for Problem with Bound Constraints Routine Name Operation ?trnlspbc_init Initializes the solver. ?trnlspbc_check Checks correctness of the input parameters. ?trnlspbc_solve Solves a nonlinear least squares problem using RCI and the Trust- Region algorithm. ?trnlspbc_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. ?trnlspbc_delete Releases allocated data. 14 Intel® Math Kernel Library Reference Manual 2504 ?trnlspbc_init Initializes the solver of nonlinear least squares problem with linear (boundary) constraints. Syntax Fortran: res = strnlspbc_init(handle, n, m, x, LW, UP, eps, iter1, iter2, rs) res = dtrnlspbc_init(handle, n, m, x, LW, UP, eps, iter1, iter2, rs) C: res = strnlspbc_init(&handle, &n, &m, x, LW, UP, eps, &iter1, &iter2, &rs); res = dtrnlspbc_init(&handle, &n, &m, x, LW, UP, eps, &iter1, &iter2, &rs); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_init routine initializes the solver. After initialization all subsequent invocations of the ?trnlspbc_solve routine should use the values of the handle returned by ?trnlspbc_init. The eps array contains the stopping criteria: eps Value Description 1 ? < eps(1) 2 ||F(x)||2 < eps(2) 3 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 4 ||s||2 < eps(4) 5 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) 6 The trial step precision. If eps(6) = 0, then the trial step meets the required precision (= 1.0D-10). Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F(x). x REAL for strnlspbc_init Nonlinear Optimization Problem Solvers 14 2505 DOUBLE PRECISION for dtrnlspbc_init Array of size n. Initial guess. LW REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size n. Contains low bounds for x (lwi < xi ). UP REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size n. Contains upper bounds for x (upi > xi ). eps REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Array of size 6; contains stopping criteria. See the values in the Description section. iter1 INTEGER. Specifies the maximum number of iterations. iter2 INTEGER. Specifies the maximum number of iterations of trial step calculation. rs REAL for strnlspbc_init DOUBLE PRECISION for dtrnlspbc_init Definition of initial size of the trust region (boundary of the trial step). The minimum value is 0.1, and the maximum value is 100.0. Based on your knowledge of the objective function and initial guess you can increase or decrease the initial trust region. It can influence the iteration process, for example, the direction of the iteration process and the number of iterations. The default value is 100.0. Output Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Informs about the task completion. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. ?trnlspbc_check Checks the correctness of handle and arrays containing Jacobian matrix, objective function, lower and upper bounds, and stopping criteria. Syntax Fortran: res = strnlspbc_check(handle, n, m, fjac, fvec, LW, UP, eps, info) res = dtrnlspbc_check(handle, n, m, fjac, fvec, LW, UP, eps, info) C: res = strnlspbc_check(&handle, &n, &m, fjac, fvec, LW, UP, eps, info); res = dtrnlspbc_check(&handle, &n, &m, fjac, fvec, LW, UP, eps, info); 14 Intel® Math Kernel Library Reference Manual 2506 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_check routine checks the arrays passed into the solver as input parameters. If an array contains any INF or NaN values, the routine sets the flag in output array info (see the description of the values returned in the Output Parameters section for the info array). Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. n INTEGER. Length of x. m INTEGER. Length of F(x). fjac REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size m by n. Contains the Jacobian matrix of the function. fvec REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). LW REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size n. Contains low bounds for x (lwi < xi ). UP REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size n. Contains upper bounds for x (upi > xi ). eps REAL for strnlspbc_check DOUBLE PRECISION for dtrnlspbc_check Array of size 6; contains stopping criteria. See the values in the Description section of the ?trnlspbc_init. Output Parameters info INTEGER Array of size 6. Results of input parameter checking: Parameter Used for Val ue Description C Language Fortran Language info(0) info(1) Flags for handle 0 The handle is valid. 1 The handle is not allocated. info(1) info(2) Flags for fjac 0 The fjac array is valid. 1 The fjac array is not allocated 2 The fjac array contains NaN. Nonlinear Optimization Problem Solvers 14 2507 Parameter Used for Val ue Description C Language Fortran Language 3 The fjac array contains Inf. info(2) info(3) Flags for fvec 0 The fvec array is valid. 1 The fvec array is not allocated 2 The fvec array contains NaN. 3 The fvec array contains Inf. info(3) info(4) Flags for LW 0 The LW array is valid. 1 The LW array is not allocated 2 The LW array contains NaN. 3 The LW array contains Inf. 4 The lower bound is greater than the upper bound. info(4) info(5) Flags for up 0 The up array is valid. 1 The up array is not allocated 2 The up array contains NaN. 3 The up array contains Inf. 4 The upper bound is less than the lower bound. info(5) info(6) Flags for eps 0 The eps array is valid. 1 The eps array is not allocated 2 The eps array contains NaN. 3 The eps array contains Inf. 4 The eps array contains a value less than or equal to zero. res INTEGER. Information about completion of the task. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_solve Solves a nonlinear least squares problem with linear (bound) constraints using the Trust-Region algorithm. 14 Intel® Math Kernel Library Reference Manual 2508 Syntax Fortran: res = strnlspbc_solve(handle, fvec, fjac, RCI_Request) res = dtrnlspbc_solve(handle, fvec, fjac, RCI_Request) C: res = strnlspbc_solve(&handle, fvec, fjac, &RCI_Request); res = dtrnlspbc_solve(&handle, fvec, fjac, &RCI_Request); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_solve routine, based on RCI, uses the Trust-Region algorithm to solve nonlinear least squares problems with linear (bound) constraints. The problem is stated as follows: where li = xi = ui i = 1, ..., n. The RCI_Request parameter provides additional information: RCI_Request Value Description 2 Request to calculate the Jacobian matrix and put the result into fjac 1 Request to recalculate the function at vector X and put the result into fvec 0 One successful iteration step on the current trust-region radius (that does not mean that the value of x has changed) -1 The algorithm has exceeded the maximum number of iterations -2 ? < eps(1) -3 ||F(x)||2 < eps(2) -4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n -5 ||s||2 < eps(4) -6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. Nonlinear Optimization Problem Solvers 14 2509 • s is the trial step. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. fvec REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m. Contains the function values at X, where fvec(i) = (yi – fi(x)). fjac REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m by n. Contains the Jacobian matrix of the function. Output Parameters fvec REAL for strnlspbc_solve DOUBLE PRECISION for dtrnlspbc_solve Array of size m. Updated function evaluated at x. RCI_Request INTEGER. Informs about the task stage. See the Description section for the parameter values and their meaning. res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_get Retrieves the number of iterations, stop criterion, initial residual, and final residual. Syntax Fortran: res = strnlspbc_get(handle, iter, st_cr, r1, r2) res = dtrnlspbc_get(handle, iter, st_cr, r1, r2) C: res = strnlspbc_get(&handle, &iter, &st_cr, &r1, &r2); res = dtrnlspbc_get(&handle, &iter, &st_cr, &r1, &r2); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine retrieves the current number of iterations, the stop criterion, the initial residual, and final residual. The st_cr parameter contains the stop criterion: st_cr Value Description 1 The algorithm has exceeded the maximum number of iterations 14 Intel® Math Kernel Library Reference Manual 2510 st_cr Value Description 2 ? < eps(1) 3 ||F(x)||2 < eps(2) 4 The Jacobian matrix is singular. ||J(x)(1:m,j)||2 < eps(3), j = 1, ..., n 5 ||s||2 < eps(4) 6 ||F(x)||2 - ||F(x) - J(x)s||2 < eps(5) Note: • J(x) is the Jacobian matrix. • ? is the trust-region area. • F(x) is the value of the functional. • s is the trial step. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters iter INTEGER. Contains the current number of iterations. st_cr INTEGER. Contains the stop criterion. See the Description section for the parameter values and their meanings. r1 REAL for strnlspbc_get DOUBLE PRECISION for dtrnlspbc_get Contains the residual, (||y - f(x)||) given the initial x. r2 REAL for strnlspbc_get DOUBLE PRECISION for dtrnlspbc_get Contains the final residual, that is, the value of the function (||y - f(x)||) of the final x resulting from the algorithm operation. res INTEGER. Informs about the task completion. res = TR_SUCCESS - the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?trnlspbc_delete Releases allocated data. Syntax Fortran: res = strnlspbc_delete(handle) res = dtrnlspbc_delete(handle) C: res = strnlspbc_delete(&handle); res = dtrnlspbc_delete(&handle); Nonlinear Optimization Problem Solvers 14 2511 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?trnlspbc_delete routine releases all memory allocated for the handle. NOTE This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _TRNSPBC_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. Jacobian Matrix Calculation Routines This section describes routines that compute the Jacobian matrix using the central difference algorithm. Jacobian matrix calculation is required to solve a nonlinear least squares problem and systems of nonlinear equations (with or without linear bound constraints). Routines for calculation of the Jacobian matrix have the "Black-Box" interfaces, where you pass the objective function via parameters. Your objective function must have a fixed interface. Jacobian Matrix Calculation Routines Routine Name Operation ?jacobi_init Initializes the solver. ?jacobi_solve Computes the Jacobian matrix of the function on the basis of RCI using the central difference algorithm. ?jacobi_delete Removes data. ?jacobi Computes the Jacobian matrix of the fcn function using the central difference algorithm. ?jacobix Presents an alternative interface for the ?jacobi function enabling you to pass additional data into the objective function. ?jacobi_init Initializes the solver for Jacobian calculations. Syntax Fortran: res = sjacobi_init(handle, n, m, x, fjac, esp) res = djacobi_init(handle, n, m, x, fjac, esp) 14 Intel® Math Kernel Library Reference Manual 2512 C: res = sjacobi_init(&handle, &n, &m, x, fjac, &eps); res = djacobi_init(&handle, &n, &m, x, fjac, &eps); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The routine initializes the solver. Input Parameters n INTEGER. Length of x. m INTEGER. Length of F. x REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Array of size n. Vector, at which the function is evaluated. eps REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Precision of the Jacobian matrix calculation. fjac REAL for sjacobi_init DOUBLE PRECISION for djacobi_init Array of size (m,n). Contains the Jacobian matrix of the function. Output Parameters handle Data object of the _JACOBIMATRIX_HANDLE_t type in C/C++ and INTEGER*8 in FORTRAN. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. ?jacobi_solve Computes the Jacobian matrix of the function using RCI and the central difference algorithm. Syntax Fortran: res = sjacobi_solve(handle, f1, f2, RCI_Request) res = djacobi_solve(handle, f1, f2, RCI_Request) C: res = sjacobi_solve(&handle, f1, f2, &RCI_Request); res = djacobi_solve(&handle, f1, f2, &RCI_Request); Nonlinear Optimization Problem Solvers 14 2513 Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi_solve routine computes the Jacobian matrix of the function using RCI and the central difference algorothm. See usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (sjacobi_rci_f.f, djacobi_rci_f.f and sjacobi_rci_c.c, djacobi_rci_c.c, respectively). Input Parameters handle Type _JACOBIMATRIX_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters f1 REAL for sjacobi_solve DOUBLE PRECISION for djacobi_solve Contains the updated function values at x + eps. f2 REAL for sjacobi_solve DOUBLE PRECISION for djacobi_solve Array of size m. Contains the updated function values at x - eps. RCI_Request INTEGER. Informs about the task completion. When equal to 0, the task has completed successfully. RCI_Request= 1 indicates that you should compute the function values at the current x point and put the results into f1. RCI_Request= 2 indicates that you should compute the function values at the current x point and put the results into f2. res INTEGER. Indicates the task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. TR_SUCCESS and TR_INVALID_OPTION are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobi_init ?jacobi_delete Releases allocated data. Syntax Fortran: res = sjacobi_delete(handle) res = djacobi_delete(handle) C: res = sjacobi_delete(&handle); 14 Intel® Math Kernel Library Reference Manual 2514 res = djacobi_delete(&handle); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi_delete routine releases all memory allocated for the handle. This routine flags memory as not used, but to actually release all memory you must call the support function mkl_free_buffers. Input Parameters handle Type _JACOBIMATRIX_HANDLE_t in C/C++ and INTEGER*8 in FORTRAN. Output Parameters res INTEGER. Informs about the task completion. res = TR_SUCCESS means the routine completed the task normally. TR_SUCCESS is defined in the mkl_rci.h and mkl_rci.fi include files. ?jacobi Computes the Jacobian matrix of the objective function using the central difference algorithm. Syntax Fortran: res = sjacobi(fcn, n, m, fjac, x, jac_eps) res = djacobi(fcn, n, m, fjac, x, jac_eps) C: res = sjacobi(fcn, &n, &m, fjac, x, &jac_eps); res = djacobi(fcn, &n, &m, fjac, x, &jac_eps); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobi routine computes the Jacobian matrix for function fcn using the central difference algorithm. This routine has a "Black-Box" interface, where you input the objective function via parameters. Your objective function must have a fixed interface. See calling and usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f.f, ex_nlsqp_bc_f.f and ex_nlsqp_c.c, ex_nlsqp_bc_c.c, respectively). Input Parameters fcn User-supplied subroutine to evaluate the function that defines the least squares problem. Call fcn (m, n, x, f) with the following parameters: Nonlinear Optimization Problem Solvers 14 2515 Parameter Type Description Input Parameters m INTEGER Length of f n INTEGER Length of x x REAL for sjacobi DOUBLE PRECISION for djacobi Array of size n. Vector, at which the function is evaluated. The fcn function should not change this parameter. Output Parameters f REAL for sjacobix DOUBLE PRECISION for djacobix Array of size m; contains the function values at x. You need to declare fcn as EXTERNAL in the calling program. n INTEGER. Length of X. m INTEGER. Length of F. x REAL for sjacobi DOUBLE PRECISION for djacobi Array of size n. Vector at which the function is evaluated. eps REAL for sjacobi DOUBLE PRECISION for djacobi Precision of the Jacobian matrix calculation. Output Parameters fjac REAL for sjacobi DOUBLE PRECISION for djacobi Array of size (m,n). Contains the Jacobian matrix of the function. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobix ?jacobix Alternative interface for ?jacobi function for passing additional data into the objective function. Syntax Fortran: res = sjacobix(fcn, n, m, fjac, x, jac_eps, user_data) res = djacobix(fcn, n, m, fjac, x, jac_eps, user_data) 14 Intel® Math Kernel Library Reference Manual 2516 C: res = sjacobix(fcn, &n, &m, fjac, x, &jac_eps, user_data); res = djacobix(fcn, &n, &m, fjac, x, &jac_eps, user_data); Include Files • Fortran: mkl_rci.fi • C: mkl_rci.h Description The ?jacobix routine presents an alternative interface for the ?jacobi function that enables you to pass additional data into the objective function fcn. See calling and usage examples in FORTRAN and C in the examples\solver\source folder of your Intel MKL directory (ex_nlsqp_f_x.f, ex_nlsqp_bc_f_x.f and ex_nlsqp_c_x.c, ex_nlsqp_bc_c_x.c, respectively). Input Parameters fcn User-supplied subroutine to evaluate the function that defines the least squares problem. Call fcn (m, n, x, f, user_data) with the following parameters: Parameter Type Description Input Parameters m INTEGER Length of f n INTEGER Length of x x REAL for sjacobix DOUBLE PRECISION for djacobix Array of size n. Vector, at which the function is evaluated. The fcn function should not change this parameter. user_data INTEGER*8, for Fortran void*, for C (Fortran) Your additional data, if any. Otherwise, a dummy argument. (C) Pointer to your additional data, if any. Otherwise, a dummy argument. Output Parameters f REAL for sjacobix DOUBLE PRECISION for djacobix Array of size m; contains the function values at x. You need to declare fcn as EXTERNAL in the calling program. n INTEGER. Length of X. m INTEGER. Length of F. x REAL for sjacobix DOUBLE PRECISION for djacobix Array of size n. Vector at which the function is evaluated. eps REAL for sjacobix DOUBLE PRECISION for djacobix Precision of the Jacobian matrix calculation. Nonlinear Optimization Problem Solvers 14 2517 user_data (Fortran) INTEGER*8. Contains your additional data. If there is no additional data, this is a dummy argument. (C) void*. Pointer to your additional data. If there is no additional data, this is a dummy argument. Output Parameters fjac REAL for sjacobix DOUBLE PRECISION for djacobix Array of size (m,n). Contains the Jacobian matrix of the function. res INTEGER. Indicates task completion status. • res = TR_SUCCESS - the routine completed the task normally. • res = TR_INVALID_OPTION - there was an error in the input parameters. • res = TR_OUT_OF_MEMORY - there was a memory error. TR_SUCCESS, TR_INVALID_OPTION, and TR_OUT_OF_MEMORY are defined in mkl_rci.fi (Fortran) and mkl_rci.h (C) include files. See Also ?jacobi 14 Intel® Math Kernel Library Reference Manual 2518 Support Functions 15 Intel® Math Kernel Library (Intel® MKL) support functions are used to: – retrieve information about the current Intel MKL version – additionally control the number of threads – handle errors – test characters and character strings for equality – measure user time for a process and elapsed CPU time – measure CPU frequency – free memory allocated by Intel MKL memory management software – facilitate easy linking Functions described below are subdivided according to their purpose into the following groups: Version Information Functions Threading Control Functions Error Handling Functions Equality Test Functions Timing Functions Memory Functions Miscellaneous Utility Functions Functions Supporting the Single Dynamic Library Table "Intel MKL Support Functions" contains the list of support functions common for Intel MKL. Intel MKL Support Functions Function Name Operation Version Information Functions mkl_get_version Returns information about the active library version. mkl_get_version_string Returns information about the library version string. Threading Control Functions mkl_set_num_threads Suggests the number of threads to use. mkl_domain_set_num_threads Suggests the number of threads for a particular function domain. mkl_set_dynamic Enables Intel MKL to dynamically change the number of threads. mkl_get_max_threads Inquires about the number of threads targeted for parallelism. mkl_domain_get_max_threads Inquires about the number of threads targeted for parallelism in different domains. mkl_get_dynamic Returns the current value of the MKL_DYNAMIC variable. Error Handling Functions 2519 Function Name Operation xerbla Handles error conditions for the BLAS, LAPACK, VSL, VML routines. pxerbla Handles error conditions for the ScaLAPACK routines. Equality Test Functions lsame Tests two characters for equality regardless of the case. lsamen Tests two character strings for equality regardless of the case. Timing Functions second/dsecnd Returns user time for a process. mkl_get_cpu_clocks Returns full precision elapsed CPU clocks. mkl_get_cpu_frequency Returns CPU frequency value in GHz. mkl_get_max_cpu_frequency Returns the maximum CPU frequency value in GHz. mkl_get_clocks_frequency Returns the frequency value in GHz based on constantrate Time Stamp Counter. Memory Functions mkl_free_buffers Frees memory buffers. mkl_thread_free_buffers Frees memory buffers allocated only in the current thread. mkl_mem_stat Reports an amount of memory utilized by Intel MKL memory management software. mkl_disable_fast_mm Enables Intel MKL to dynamically turn off memory management. mkl_malloc Allocates the aligned memory buffer. mkl_free Frees the aligned memory buffer allocated by MKL_malloc. Miscellaneous Utility Functions mkl_progress Tracks computational progress of selective MKL routines. mkl_enable_instructions Allows Intel MKL to dispatch Intel® Advanced Vector Extensions (Intel® AVX) if run on the respective hardware (or simulation). Functions Supporting the Single Dynamic Library (SDL) mkl_set_interface_layer Sets the interface layer for Intel MKL at run time. mkl_set_threading_layer Sets the threading layer for Intel MKL at run time. mkl_set_xerbla Replaces the error handling routine. Use with SDL on Windows* OS. mkl_set_progress Replaces the progress information routine. Use with SDL on Windows* OS. 15 Intel® Math Kernel Library Reference Manual 2520 Version Information Functions Intel® MKL provides two methods for extracting information about the library version number: • extracting a version string using the mkl_get_version_string function • using the mkl_get_version function to obtain an MKLVersion structure that contains the version information A makefile is also provided to automatically build the examples and output summary files containing the version information for the current library. mkl_get_version Returns information about the active library C version. Syntax void mkl_get_version( MKLVersion* pVersion ); Include Files • C: mkl_service.h Output Parameters pVersion Pointer to the MKLVersion structure. Description The mkl_get_version function collects information about the active C version of the Intel MKL software and returns this information in a structure of MKLVersion type by the pVersion address. The MKLVersion structure type is defined in the mkl_types.h file. The following fields of the MKLVersion structure are available: MajorVersion is the major number of the current library version. MinorVersion is the minor number of the current library version. UpdateVersion is the update number of the current library version. ProductStatus is the status of the current library version. Possible variants could be “Beta”, “Product”. Build is the string that contains the build date and the internal build number. Processor is the processor optimization that is targeted for the specific processor. It is not the definition of the processor installed in the system, rather the MKL library detection that is optimal for the processor installed in the system. NOTE MKLGetVersion is an obsolete name for the mkl_get_version function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for Support Functions 15 2521 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_get_version Usage ---------------------------------------------------------------------------------------------- #include #include #include "mkl_service.h" int main(void) { MKLVersion Version; mkl_get_version(&Version); // MKL_Get_Version(&Version); printf("Major version: %d\n",Version.MajorVersion); printf("Minor version: %d\n",Version.MinorVersion); printf("Update version: %d\n",Version.UpdateVersion); printf("Product status: %s\n",Version.ProductStatus); printf("Build: %s\n",Version.Build); printf("Processor optimization: %s\n",Version.Processor); printf("================================================================\n"); printf("\n"); return 0; } Output: Major Version 9 Minor Version 0 Update Version 0 Product status Product Build 061909.09 Processor optimization Intel® Xeon® Processor with Intel® 64 architecture 15 Intel® Math Kernel Library Reference Manual 2522 mkl_get_version_string Gets the library version string. Syntax Fortran: call mkl_get_version_string( buf ) C: mkl_get_version_string( buf, len ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description buf FORTRAN: CHARACTER*198 C: char* Source string len FORTRAN: INTEGER C: int Length of the source string Description The function returns a string that contains the library version information. NOTE MKLGetVersionString is an obsolete name for the mkl_get_version_string function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. See example below: Examples Fortran Example program mkl_get_version_string character*198 buf call mkl_get_version_string(buf) write(*,'(a)') buf end C Example #include #include "mkl_service.h" int main(void) { int len=198; char buf[198]; mkl_get_version_string(buf, len); printf("%s\n",buf); Support Functions 15 2523 printf("\n"); return 0; } Threading Control Functions Intel® MKL provides optional threading control functions that take precedence over OpenMP* environment variable settings with the same purpose (see Intel® MKL User's Guide for details). These functions enable you to specify the number of threads for Intel MKL independently of the OpenMP* settings and takes precedence over them. Although Intel MKL may actually use a different number of threads from the number suggested, the controls also enable you to instruct the library to try using the suggested number when the number used in the calling application is unavailable. See the following examples of Fortran and C usage: Fortran Usage call mkl_set_num_threads( foo ) ierr = mkl_domain_set_num_threads( num, MKL_DOMAIN_BLAS ) call mkl_set_dynamic ( 1 ) num = mkl_get_max_threads() num = mkl_domain_get_max_threads( MKL_DOMAIN_BLAS ); ret = mkl_get_dynamic() C Usage #include "mkl.h" // Mandatory to make these definitions work! mkl_set_num_threads(num); return_code = mkl_domain_set_num_threads( num, MKL_DOMAIN_FFT ); mkl_set_dynamic( 1 ); num = mkl_get_max_threads(); num = mkl_domain_get_max_threads( MKL_DOMAIN_FFT ); return_code = mkl_get_dynamic(); NOTE Always remember to add #include "mkl.h" to use the C usage syntax. mkl_set_num_threads Suggests the number of threads to use. Syntax Fortran: call mkl_set_num_threads( number ) C: void mkl_set_num_threads( number ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h 15 Intel® Math Kernel Library Reference Manual 2524 Input Parameters Name Type Description number FORTRAN: INTEGER C: int Number of threads suggested by user Description This function allows you to specify how many threads Intel MKL should use. The number is a hint, and there is no guarantee that exactly this number of threads will be used. Enter a positive integer. This routine takes precedence over the MKL_NUM_THREADS environment variable. NOTE Always remember to add #include "mkl.h" to use the C usage syntax. See Intel MKL User's Guide for implementation details. mkl_domain_set_num_threads Suggests the number of threads for a particular function domain. Syntax Fortran: ierr = mkl_domain_set_num_threads( num, mask ) C: ierr = mkl_domain_set_num_threads( num, mask ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description num FORTRAN: INTEGER C: int Number of threads suggested by user mask FORTRAN: INTEGER C: int Name of the targeted domain Description This function allows you to request different domains of Intel MKL to use different numbers of threads. The currently supported domains are: • MKL_DOMAIN_BLAS - BLAS • MKL_DOMAIN_FFT - FFT (excluding cluster FFT) • MKL_DOMAIN_VML - Vector Math Library • MKL_DOMAIN_PARDISO - PARDISO • MKL_DOMAIN_ALL - another way to do what mkl_set_num_threads does Support Functions 15 2525 This is only a hint, and use of this number of threads is not guaranteed. Enter a valid domain and a positive integer for the number of threads. This routine has precedence over the MKL_DOMAIN_NUM_THREADS environment variable. See Intel MKL User's Guide for implementation details. Return Values 1(true) Indicates no error, execution is successful. 0(false) Indicates failure, possibly because the inputs were invalid. mkl_set_dynamic Enables Intel MKL to dynamically change the number of threads. Syntax Fortran: call mkl_set_dynamic( boolean_var ) C: void mkl_set_dynamic( boolean_var ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description boolean_v ar FORTRAN: INTEGER C: int The parameter that determines whether dynamic adjustment of the number of threads is enabled or disabled. Description This function indicates whether or not Intel MKL can dynamically change the number of threads. The default for this is true, regardless of how the OMP_DYNAMIC variable is set. This will also hold precedent over the OMP_DYNAMIC variable. A value of false does not guarantee that the user's requested number of threads will be used. But it means that Intel MKL will attempt to use that value. This routine takes precedence over the environment variable MKL_DYNAMIC. Note that if Intel MKL is called from within a parallel region, Intel MKL may not thread unless MKL_DYNAMIC is set to false, either with the environment variable or by this routine call. See Intel MKL User's Guide for implementation details. mkl_get_max_threads Inquires about the number of threads targeted for parallelism. 15 Intel® Math Kernel Library Reference Manual 2526 Syntax Fortran: num = mkl_get_max_threads() C: num = mkl_get_max_threads(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description This function allows you to inquire independently of OpenMP* how many threads Intel MKL is targeting for parallelism. The number is a hint, and there is no guarantee that exactly this number of threads will be used. See Intel MKL User's Guide for implementation details. Return Values The output is INTEGER equal to the number of threads. mkl_domain_get_max_threads Inquires about the number of threads targeted for parallelism in different domains. Syntax Fortran: ierr = mkl_domain_get_max_threads( mask ) C: ierr = mkl_domain_get_max_threads( mask ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description mask FORTRAN: INTEGER C: int The name of the targeted domain Description This function allows the user of different domains of Intel MKL to inquire what number of threads is being used as a hint. The inquiry does not imply that this is the actual number of threads used. The number may vary depending on the value of the MKL_DYNAMIC variable and/or problem size, system resources, etc. But the function returns the value that MKL is targeting for a given domain. The currently supported domains are: • MKL_DOMAIN_BLAS - BLAS Support Functions 15 2527 • MKL_DOMAIN_FFT - FFT (excluding cluster FFT) • MKL_DOMAIN_VML - Vector Math Library • MKL_DOMAIN_PARDISO - PARDISO • MKL_DOMAIN_ALL - another way to do what mkl_get_max_threads does. You are supposed to enter a valid domain. See Intel MKL User's Guide for implementation details. Return Values Returns the hint about the number of threads for a given domain. mkl_get_dynamic Returns current value of MKL_DYNAMIC variable. Syntax Fortran: ret = mkl_get_dynamic() C: ret = mkl_get_dynamic(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description This function returns the current value of the MKL_DYNAMIC variable. This variable can be changed by manipulating the MKL_DYNAMIC environment variable before the Intel MKL run is launched or by calling mkl_set_dynamic(). Doing the latter has precedence over the former. The function returns a value of 0 or 1: 1 indicates that MKL_DYNAMIC is true, 0 indicates that MKL_DYNAMIC is false. This variable indicates whether or not Intel MKL can dynamically change the number of threads. A value of false does not guarantee that the number of threads you requested will be used. But it means that Intel MKL will attempt to use that value. Note that if Intel MKL is called from within a parallel region, Intel MKL may not thread unless MKL_DYNAMIC is set to false, either with the environment variable or by this routine call. See Intel MKL User's Guide for implementation details. Return Values 1 Indicates MKL_DYNAMIC is true. 0 Indicates MKL_DYNAMIC is false. Error Handling Functions 15 Intel® Math Kernel Library Reference Manual 2528 xerbla Error handling routine called by BLAS, LAPACK, VML, VSL routines. Syntax Fortran: call xerbla( srname, info ) C: xerbla( srname, info, len ); Include Files • FORTRAN 77: mkl_blas.fi • C: mkl_blas.h Input Parameters Name Type Description srname FORTRAN: CHARACTER*(*) C: char* The name of the routine that called xerbla info FORTRAN: INTEGER C: int* The position of the invalid parameter in the parameter list of the calling routine len C: int Length of the source string Description The routine xerbla is an error handler for the BLAS, LAPACK, VSL, and VML routines. It is called by a BLAS, LAPACK, VSL or VML routine if an input parameter has an invalid value. If an issue is found with an input parameter, xerbla prints a message similar to the following: MKL ERROR: Parameter 6 was incorrect on entry to DGEMM and then returns to your application. Comments in the LAPACK reference code (http://www.netlib.org/ lapack/explore-html/xerbla.f.html) suggest this behavior though the LAPACK User's Guide recommends that the execution should stop when an error is found. Note that xerbla is an internal function. You can change or disable printing of an error message by providing your own xerbla function. See the FORTRAN and C examples below. Examples subroutine xerbla (srname, info) character*(*) srname !Name of subprogram that called xerbla integer*4 info !Position of the invalid parameter in the parameter list return !Return to the calling subprogram end Support Functions 15 2529 void xerbla(char* srname, int* info, int len){ // srname - name of the function that called xerbla // info - position of the invalid parameter in the parameter list // len - length of the name in bytes printf("\nXERBLA is called :%s: %d\n",srname,*info); } pxerbla Error handling routine called by ScaLAPACK routines. Syntax call pxerbla(ictxt, srname, info) Include Files • C: mkl_scalapack.h Input Parameters ictxt (global) INTEGER The BLACS context handle, indicating the global context of the operation. The context itself is global. srname (global) CHARACTER*6 The name of the routine which called pxerbla. info (global) INTEGER. The position of the invalid parameter in the parameter list of the calling routine. Description This routine is an error handler for the ScaLAPACK routines. It is called if an input parameter has an invalid value. A message is printed and program execution continues. For ScaLAPACK driver and computational routines, a RETURN statement is issued following the call to pxerbla. Control returns to the higher-level calling routine, and you can determine how the program should proceed. However, in the specialized low-level ScaLAPACK routines (auxiliary routines that are Level 2 equivalents of computational routines), the call to pxerbla() is immediately followed by a call to BLACS_ABORT() to terminate program execution since recovery from an error at this level in the computation is not possible. It is always good practice to check for a nonzero value of info on return from a ScaLAPACK routine. Installers may consider modifying this routine in order to call system-specific exception-handling facilities. Equality Test Functions lsame Tests two characters for equality regardless of the case. 15 Intel® Math Kernel Library Reference Manual 2530 Syntax Fortran: val = lsame( ca, cb ) C: val = lsame( ca, cb ); Include Files • FORTRAN 77: mkl_blas.fi • C: mkl_blas.h Input Parameters Name Type Description ca, cb FORTRAN: CHARACTER*1 C: const char* FORTRAN: The single characters to be compared C: Pointers to the single characters to be compared Output Parameters Name Type Description val FORTRAN: LOGICAL C: int Result of the comparison Description This logical function returns .TRUE. if ca is the same letter as cb regardless of the case, and .FALSE. otherwise. lsamen Tests two character strings for equality regardless of the case. Syntax Fortran: val = lsamen( n, ca, cb ) C: val = lsamen( n, ca, cb ); Include Files • FORTRAN 77: mkl_lapack.fi • C: mkl_lapack.h Input Parameters Name Type Description n FORTRAN: INTEGER FORTRAN: The number of characters in ca and cb to be compared. Support Functions 15 2531 Name Type Description C: const int* C: Pointer to the number of characters in ca and cb to be compared. ca, cb FORTRAN: CHARACTER*(*) C: const char* Specify two character strings of length at least n to be compared. Only the first n characters of each string will be accessed. Output Parameters Name Type Description val FORTRAN: LOGICAL C: int FORTRAN: Result of the comparison. .TRUE. if ca and cb are equivalent except for the case, and .FALSE. otherwise. The function also returns .FALSE. if len(ca) or len(cb) is less than n. C: Result of the comparison. Non-zero if ca and cb are equivalent except for the case, and zero otherwise. Description This logical function tests if the first n letters of one string are the same as the first n letters of another string, regardless of the case. Timing Functions second/dsecnd Returns elapsed CPU time in seconds. Syntax Fortran: val = second() val = dsecnd() C: val = second(); val = dsecnd(); Include Files • FORTRAN 77: mkl_lapack.fi • C: mkl_lapack.h Output Parameters Name Type Description val FORTRAN: REAL for second DOUBLE PRECISION for dsecnd Elapsed CPU time in seconds 15 Intel® Math Kernel Library Reference Manual 2532 Name Type Description C: float for second double for dsecnd Description The second/dsecnd functions return the elapsed CPU time in seconds. The difference between these functions is that dsecnd returns the result with double precision. Apply each function in pairs: the first time, directly before a call to the routine to be measured, and the second time - after the measurement. The difference between the returned values is the time spent in the routine. The second/dsecnd functions get the time from the elapsed CPU clocks divided by frequency. Obtaining the frequency may take some time when the second/dsecnd function runs for the first time. To eliminate the effect of this extra time on your measurements, make the first call to second/dsecnd in advance. Do not use second for measuring short time intervals because the single-precision format is not capable of holding sufficient timer precision. mkl_get_cpu_clocks Returns full precision elapsed CPU clocks. Syntax Fortran: call mkl_get_cpu_clocks( clocks ) C: mkl_get_cpu_clocks( &clocks ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description clocks FORTRAN: INTEGER*8 C: unsigned MKL_INT64 Elapsed CPU clocks Description The mkl_get_cpu_clocks function returns the elapsed CPU clocks. This may be useful when timing short intervals with high resolution. The mkl_get_cpu_clocks function is also applied in pairs like second/dsecnd. Note that out-of-order code execution on IA-32 or Intel® 64 architecture processors may disturb the exact elapsed CPU clocks value a little bit, which may be important while measuring extremely short time intervals. NOTE getcpuclocks is an obsolete name for the mkl_get_cpu_clocks function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. Support Functions 15 2533 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_get_cpu_frequency Returns the current CPU frequency value in GHz. Syntax Fortran: freq = mkl_get_cpu_frequency() C: freq = mkl_get_cpu_frequency(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Current CPU frequency value in GHz Description The function mkl_get_cpu_frequency returns the current CPU frequency in GHz. NOTE getcpufrequency is an obsolete name for the mkl_get_cpu_frequency function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. mkl_get_max_cpu_frequency Returns the maximum CPU frequency value in GHz. Syntax Fortran: freq = mkl_get_max_cpu_frequency() C: freq = mkl_get_max_cpu_frequency(); 15 Intel® Math Kernel Library Reference Manual 2534 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Maximum CPU frequency value in GHz Description The function mkl_get_max_cpu_frequency returns the maximum CPU frequency in GHz. mkl_get_clocks_frequency Returns the frequency value in GHz based on constant-rate Time Stamp Counter. Syntax Fortran: freq = mkl_get_clocks_frequency() C: freq = mkl_get_clocks_frequency(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description freq FORTRAN: DOUBLE PRECISION C: double Frequency value in GHz Description The function mkl_get_clocks_frequency returns the CPU frequency value (in GHz) based on constant-rate Time Stamp Counter (TSC). Use of the constant-rate TSC ensures that each clock tick is constant even if the CPU frequency changes. Therefore, the returned frequency is constant. NOTE Obtaining the frequency may take some time when mkl_get_clocks_frequency is called for the first time. The same holds for functions second/dsecnd, which call mkl_get_clocks_frequency. See Also second/dsecnd Support Functions 15 2535 Memory Functions This section describes the Intel MKL memory support functions. See the Intel® MKL User's Guide for details of the Intel MKL memory management. mkl_free_buffers Frees memory buffers. Syntax Fortran: call mkl_free_buffers C: mkl_free_buffers(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The mkl_free_buffers function frees the memory allocated by the Intel MKL memory management software. The memory management software allocates new buffers if no free buffers are currently available. Call mkl_free_buffers() to free all memory buffers and to avoid memory leaking on completion of work with the Intel MKL functions, that is, after the last call of an Intel MKL function from your application. See Intel® MKL User's Guide for details. NOTE MKL_FreeBuffers is an obsolete name for the mkl_free_buffers function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. 15 Intel® Math Kernel Library Reference Manual 2536 mkl_free_buffers Usage with FFT Functions ---------------------------------------------------------------------------------------------- DFTI_DESCRIPTOR_HANDLE hand1; DFTI_DESCRIPTOR_HANDLE hand2; void mkl_free_buffers(void); . . . . . . /* Using MKL FFT */ Status = DftiCreateDescriptor(&hand1, DFTI_SINGLE, DFTI_COMPLEX, dim, m1); Status = DftiCommitDescriptor(hand1); Status = DftiComputeForward(hand1, s_array1); . . . . . . Status = DftiCreateDescriptor(&hand2, DFTI_SINGLE, DFTI_COMPLEX, dim, m2); Status = DftiCommitDescriptor(hand2); . . . . . . Status = DftiFreeDescriptor(&hand1); /* Do not call mkl_free_buffers() here as the hand2 descriptor will be corrupted! */ . . . . . . Status = DftiComputeBackward(hand2, s_array2)); Status = DftiFreeDescriptor(&hand2); /* Here user finishes the MKL FFT usage */ /* Memory leak will be triggered by any memory control tool */ /* Use mkl_free_buffers() to avoid memory leaking */ mkl_free_buffers(); ---------------------------------------------------------------------------------------------- If the memory space is sufficient, use mkl_free_buffers after the last call of the MKL functions. Otherwise, a drop in performance can occur due to reallocation of buffers for the subsequent MKL functions. WARNING For FFT calls, do not use mkl_free_buffers between DftiCreateDescriptor(hand) and DftiFreeDescriptor(&hand). mkl_thread_free_buffers Frees memory buffers allocated in the current thread. Syntax Fortran: call mkl_thread_free_buffers C: mkl_thread_free_buffers(); Support Functions 15 2537 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The mkl_thread_free_buffers function frees the memory allocated by the Intel MKL memory management in the current thread only. Memory buffers allocated in other threads are not affected. Call mkl_thread_free_buffers() to avoid memory leaking if you are unable to call the mkl_free_buffers function in the multi-threaded application when you are not sure if all the other running Intel MKL functions completed operation. mkl_disable_fast_mm Enables Intel MKL to dynamically turn off memory management. Syntax Fortran: mm = mkl_disable_fast_mm C: mm = mkl_disable_fast_mm(); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Description The Intel MKL memory management software is turned on by default. To turn it off dynamically before any Intel MKL function call, you can use the mkl_disable_fast_mm function similarly to the MKL_DISABLE_FAST_MM environment variable (See Intel® MKL User's Guide for details.) Run mkl_disable_fast_mm function to allocate and free memory from call to call. Note that disabling the Intel MKL memory management software negatively impacts performance of some Intel MKL routines, especially for small problem sizes. The function return value 1 indicates that the Intel MKL memory management was turned off successfully. The function return value 0 indicates a failure. mkl_mem_stat Reports amount of memory utilized by Intel MKL memory management software. Syntax Fortran: AllocatedBytes = mkl_mem_stat( AllocatedBuffers ) C: AllocatedBytes = mkl_mem_stat( &AllocatedBuffers ); 15 Intel® Math Kernel Library Reference Manual 2538 Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Output Parameters Name Type Description AllocatedBytes FORTRAN: INTEGER*8 C: MKL_INT64 Amount of allocated bytes AllocatedBuffers FORTRAN: INTEGER*4, C: int Number of allocated buffers Description The function returns the amount of the allocated memory in the AllocatedBuffers buffers. If there are no allocated buffers at the moment, the function returns 0. Call the mkl_mem_stat() function to check the Intel MKL memory status. Note that after calling mkl_free_buffers there should not be any allocated buffers. See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". NOTE MKL_MemStat is an obsolete name for the MKL_Mem_Stat function that is referenced in the library for back compatibility purposes but is deprecated and subject to removal in subsequent releases. mkl_malloc Allocates the aligned memory buffer. Syntax Fortran: a_ptr = mkl_malloc( alloc_size, alignment ) C: a_ptr = mkl_malloc( alloc_size, alignment ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description alloc_size FORTRAN: INTEGER*4 C: size_t Size of the buffer to be allocated Note that Fortran type INTEGER*4 is given for the 32-bit systems. Otherwise, it is INTEGER*8. alignment FORTRAN: INTEGER*4 Alignment of the allocated buffer Support Functions 15 2539 Name Type Description C: int Output Parameters Name Type Description a_ptr FORTRAN: POINTER C: void* Pointer to the allocated buffer Description The function allocates a size-bytes buffer, aligned on the alignment boundary, and returns a pointer to this buffer. The function returns NULL if size < 1. If alignment is not power of 2, the alignment 32 is used. See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". mkl_free Frees the aligned memory buffer allocated by mkl_malloc. Syntax Fortran: call mkl_free( a_ptr ) C: mkl_free( a_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description a_ptr FORTRAN: POINTER C: void* Pointer to the buffer to be freed Description The function frees the buffer pointed by ptr and allocated by mkl_malloc(). See Example "mkl_malloc(), mkl_free(), mkl_mem_stat() Usage". Examples of mkl_malloc(), mkl_free(), mkl_mem_stat() Usage Usage Example in Fortran PROGRAM FOO REAL*8 A,B,C 15 Intel® Math Kernel Library Reference Manual 2540 POINTER (A_PTR,A(1)), (B_PTR,B(1)), (C_PTR,C(1) INTEGER N, I REAL*8 ALPHA, BETA INTEGER*8 ALLOCATED_BYTES INTEGER*4 ALLOCATED_BUFFERS #ifdef _SYSTEM_BITS32 INTEGER*4 MKL_MALLOC INTEGER*4 ALLOC_SIZE #else INTEGER*8 MKL_MALLOC INTEGER*8 ALLOC_SIZE #endif INTEGER MKL_MEM_STAT EXTERNAL MKL_MALLOC, MKL_FREE, MKL_MEM_STAT ALPHA = 1.1; BETA = -1.2 N = 1000 ALLOC_SIZE = 8*N*N A_PTR = MKL_MALLOC(ALLOC_SIZE,64) B_PTR = MKL_MALLOC(ALLOC_SIZE,64) C_PTR = MKL_MALLOC(ALLOC_SIZE,64) DO I=1,N*N A(I) = I B(I) = -I C(I) = 0.0 END DO CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N); ALLOCATED_BYTES = MKL_MEM_STAT(ALLOCATED_BUFFERS) PRINT *,'DGEMM uses ',ALLOCATED_BYTES,' bytes in ', $ ALLOCATED_BUFFERS,' buffers ' CALL MKL_FREE_BUFFERS ALLOCATED_BYTES = MKL_MEM_STAT(ALLOCATED_BUFFERS) IF (ALLOCATED_BYTES > 0) THEN PRINT *,'MKL MEMORY LEAK!' PRINT *,'AFTER MKL_FREE_BUFFERS there are ', $ ALLOCATED_BYTES,' bytes in ', $ ALLOCATED_BUFFERS,' buffers' END IF CALL MKL_FREE(A_PTR) CALL MKL_FREE(B_PTR) CALL MKL_FREE(C_PTR) STOP END Usage Example in C #include #include int main(void) { double *a, *b, *c; int n, i; double alpha, beta; MKL_INT64 AllocatedBytes; int N_AllocatedBuffers; alpha = 1.1; beta = -1.2; n = 1000; a = (double*)mkl_malloc(n*n*sizeof(double),64); b = (double*)mkl_malloc(n*n*sizeof(double),64); c = (double*)mkl_malloc(n*n*sizeof(double),64); for (i=0;i<(n*n);i++) { a[i] = (double)(i+1); b[i] = (double)(-i-1); Support Functions 15 2541 c[i] = 0.0; } dgemm("N","N",&n,&n,&n,&alpha,a,&n,b,&n,&beta,c,&n); AllocatedBytes = mkl_mem_stat(&N_AllocatedBuffers); printf("\nDGEMM uses %ld bytes in %d buffers",(long)AllocatedBytes,N_AllocatedBuffers); mkl_free_buffers(); AllocatedBytes = mkl_mem_stat(&N_AllocatedBuffers); if (AllocatedBytes > 0) { printf("\nMKL memory leak!"); printf("\nAfter mkl_free_buffers there are %ld bytes in %d buffers", (long)AllocatedBytes,N_AllocatedBuffers); } mkl_free(a); mkl_free(b); mkl_free(c); return 0; } Miscellaneous Utility Functions Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 mkl_progress Provides progress information. Syntax Fortran: stopflag = mkl_progress( thread, step, stage ) C: stopflag = mkl_progress( thread, step, stage, lstage ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_lapack.h and mkl_service.h Input Parameters Name Type Description thread FORTRAN: INTEGER*4 C: const int* FORTRAN: The number of the thread the progress routine is called from. 0 is passed for sequential code. 15 Intel® Math Kernel Library Reference Manual 2542 Name Type Description C: Pointer to the number of the thread the progress routine is called from. 0 is passed for sequential code. step FORTRAN: INTEGER*4 C: const int* FORTRAN: The linear progress indicator that shows the amount of work done. Increases from 0 to the linear size of the problem during the computation. C: Pointer to the linear progress indicator that shows the amount of work done. Increases from 0 to the linear size of the problem during the computation. stage FORTRAN: CHARACTER*(*) C: const char* Message indicating the name of the routine or the name of the computation stage the progress routine is called from. lstage C: int The length of a stage string excluding the trailing NULL character. Output Parameters Name Type Description stopflag FORTRAN: INTEGER C: int The stopping flag. A non-zero flag forces the routine to be interrupted. The zero flag is the default return value. Description The mkl_progress function is intended to track progress of a lengthy computation and/or interrupt the computation. By default this routine does nothing but the user application can redefine it to obtain the computation progress information. You can set it to perform certain operations during the routine computation, for instance, to print a progress indicator. A non-zero return value may be supplied by the redefined function to break the computation. The progress function mkl_progress is regularly called from some LAPACK and DSS/PARDISO functions during the computation. Refer to a specific LAPACK or DSS/PARDISO function description to see whether the function supports this feature or not. Application Notes Note that mkl_progress is a Fortran routine, that is, to redefine the progress routine from C, the name should be spelt differently, parameters should be passed by reference, and an extra parameter meaning the length of the stage string should be considered. The stage string is not terminated with the NULL character. The C interface of the progress routine is as follows: int mkl_progress_( int* thread, int* step, char* stage, int lstage ); // Linux, Mac int MKL_PROGRESS( int* thread, int* step, char* stage, int lstage ); // Windows See further the examples of printing a progress information on the standard output in Fortran and C languages: Examples Fortran example: integer function mkl_progress( thread, step, stage ) integer*4 thread, step character*(*) stage print*,'Thread:',thread,',stage:',stage,',step:',step mkl_progress = 0 return end Support Functions 15 2543 C example: #include #include #define BUFLEN 16 int mkl_progress_( int* ithr, int* step, char* stage, int lstage ) { char buf[BUFLEN]; if( lstage >= BUFLEN ) lstage = BUFLEN-1; strncpy( buf, stage, lstage ); buf[lstage] = '\0'; printf( "In thread %i, at stage %s, steps passed %i\n", *ithr, buf, *step ); return 0; } mkl_enable_instructions Allows dispatching Intel® Advanced Vector Extensions. Syntax Fortran: irc = mkl_enable_instructions(MKL_AVX_ENABLE) C: irc = mkl_enable_instructions(MKL_AVX_ENABLE); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters MKL_AVX_ENABLE Parameter indicating which new instructions the user needs to enable. Output Parameters Name Type Description irc FORTRAN: INTEGER*4 C: int Value reflecting AVX usage status: =1 MKL uses the AVX code, if the hardware supports Intel® AVX. =0 The request is rejected. Most likely, mkl_enable_instructions has been called after another Intel MKL function. Description This function is currently void and deprecated but can be used in future Intel MKL releases. NOTE Always remember to add #include "mkl.h" to use the C usage syntax. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on 15 Intel® Math Kernel Library Reference Manual 2544 Optimization Notice microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Functions Supporting the Single Dynamic Library Intel® MKL provides the Single Dynamic Library (SDL), which enables setting the interface and threading layer for Intel MKL at run time. See Intel® MKL User's Guide for details of SDL and layered model concept. This section describes the functions supporting SDL. mkl_set_interface_layer Sets the interface layer for Intel MKL at run time. Use with the Single Dynamic Library. Syntax Fortran: interface = mkl_set_interface_layer( required_interface ) C: interface = mkl_set_interface_layer( required_interface ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description required_interface FORTRAN: INTEGER C: int Determines the interface layer. Possible values: MKL_INTERFACE_LP64 for the LP64 interface. MKL_INTERFACE_ILP64 for the ILP64 interface. Description If you are using the Single Dynamic Library (SDL), the mkl_set_interface_layer function sets LP64 or ILP64 interface for Intel MKL at run time. Call this function prior to calling any other Intel MKL function in your application except mkl_set_threading_layer. You can call mkl_set_interface_layer and mkl_set_threading_layer in any order. The mkl_set_interface_layer function takes precedence over the MKL_INTERFACE_LAYER environment variable. See Intel MKL User's Guide for the layered model concept and usage details of SDL. Support Functions 15 2545 mkl_set_threading_layer Sets the threading layer for Intel MKL at run time. Use with the Single Dynamic Library (SDL). Syntax Fortran: threading = mkl_set_threading_layer( required_threading ) C: threading = mkl_set_threading_layer( required_threading ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description required_threading FORTRAN: INTEGER C: int Determines the threading layer. Possible values: MKL_THREADING_INTEL for Intel threading. MKL_THREADING_SEQUENTIAL for the sequential mode of Intel MKL. MKL_THREADING_PGI for PGI threading on Windows* or Linux* operating system only. MKL_THREADING_GNU for GNU threading on Linux* operating system only. Description If you are using the Single Dynamic Library (SDL), the mkl_set_threading_layer function sets the specified threading layer for Intel MKL at run time. Call this function prior to calling any other Intel MKL function in your application except mkl_set_interface_layer. You can call mkl_set_threading_layer and mkl_set_interface_layer in any order. The mkl_set_threading_layer function takes precedence over the MKL_THREADING_LAYER environment variable. See Intel MKL User's Guide for the layered model concept and usage details of SDL. mkl_set_xerbla Replaces the error handling routine. Use with the Single Dynamic Library on Windows* OS. Syntax Fortran: old_xerbla_ptr = mkl_set_xerbla( new_xerbla_ptr ) 15 Intel® Math Kernel Library Reference Manual 2546 C: old_xerbla_ptr = mkl_set_xerbla( new_xerbla_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description new_xerbla_ptr XerblaEntry Pointer to the error handling routine to be used. Description If you are linking with the Single Dynamic Library (SDL) mkl_rt.lib on Windows* OS, the mkl_set_xerbla function replaces the error handling routine that is called by Intel MKL functions with the routine specified by the parameter. See Intel MKL User's Guide for details of SDL. Return Values The function returns the pointer to the replaced error handling routine. See Also xerbla mkl_set_progress Replaces the progress information routine. Use with the Single Dynamic Library (SDL) on Windows* OS. Syntax Fortran: old_progress_ptr mkl_set_progress( new_progress_ptr ) C: old_progress_ptr mkl_set_progress( new_progress_ptr ); Include Files • FORTRAN 77: mkl_service.fi • C: mkl_service.h Input Parameters Name Type Description new_progress_ptr ProgressEntry Pointer to the progress information routine to be used. Description If you are linking with the Single Dynamic Library (SDL) mkl_rt.lib on Windows* OS, the mkl_set_progress function replaces the currently used progress information routine with the routine specified by the parameter. See Intel MKL User's Guide for details of SDL. Support Functions 15 2547 Return Values The function returns the pointer to the replaced progress information routine. See Also mkl_progress 15 Intel® Math Kernel Library Reference Manual 2548 BLACS Routines 16 This chapter describes the Intel® Math Kernel Library implementation of FORTRAN 77 routines from the BLACS (Basic Linear Algebra Communication Subprograms) package. These routines are used to support a linear algebra oriented message passing interface that may be implemented efficiently and uniformly across a large range of distributed memory platforms. The BLACS routines make linear algebra applications both easier to program and more portable. For this purpose, they are used in Intel MKL intended for the Linux* and Windows* OSs as the communication layer of ScaLAPACK and Cluster FFT. On computers, a linear algebra matrix is represented by a two dimensional array (2D array), and therefore the BLACS operate on 2D arrays. See description of the basic matrix shapes in a special section. The BLACS routines implemented in Intel MKL are of four categories: • Combines • Point to Point Communication • Broadcast • Support. The Combines take data distributed over processes and combine the data to produce a result. The Point to Point routines are intended for point-to-point communication and Broadcast routines send data possessed by one process to all processes within a scope. The Support routines perform distinct tasks that can be used for initialization, destruction, information, and miscellaneous tasks. Matrix Shapes The BLACS routines recognize the two most common classes of matrices for dense linear algebra. The first of these classes consists of general rectangular matrices, which in machine storage are 2D arrays consisting of m rows and n columns, with a leading dimension, lda, that determines the distance between successive columns in memory. The general rectangular matrices take the following parameters as input when determining what array to operate on: m (input) INTEGER. The number of matrix rows to be operated on. n (input) INTEGER. The number of matrix columns to be operated on. a (input/output) TYPE (depends on routine), array of dimension (lda,n). A pointer to the beginning of the (sub)array to be sent. lda (input) INTEGER. The distance between two elements in matrix row. The second class of matrices recognized by the BLACS are trapezoidal matrices (triangular matrices are a sub-class of trapezoidal). Trapezoidal arrays are defined by m, n, and lda, as above, but they have two additional parameters as well. These parameters are: uplo (input) CHARACTER*1 . Indicates whether the matrix is upper or lower trapezoidal, as discussed below. diag (input) CHARACTER*1 . Indicates whether the diagonal of the matrix is unit diagonal (will not be operated on) or otherwise (will be operated on). 2549 The shape of the trapezoidal arrays is determined by these parameters as follows: Trapezoidal Arrays Shapes The packing of arrays, if required, so that they may be sent efficiently is hidden, allowing the user to concentrate on the logical matrix, rather than on how the data is organized in the system memory. BLACS Combine Operations This section describes BLACS routines that combine the data to produce a result. In a combine operation, each participating process contributes data that is combined with other processes’ data to produce a result. This result can be given to a particular process (called the destination process), or to all participating processes. If the result is given to only one process, the operation is referred to as a leave-on-one combine, and if the result is given to all participating processes the operation is referenced as a leave-on-all combine. At present, three kinds of combines are supported. They are: • element-wise summation • element-wise absolute value maximization • element-wise absolute value minimization of general rectangular arrays. Note that a combine operation combines data between processes. By definition, a combine performed across a scope of only one process does not change the input data. This is why the operations (max/min/sum) are specified as element-wise. Element-wise indicates that each element of the input array will be combined with the corresponding element from all other processes’ arrays to produce the result. Thus, a 4 x 2 array of inputs produces a 4 x 2 answer array. When the max/min comparison is being performed, absolute value is used. For example, -5 and 5 are equivalent. However, the returned value is unchanged; that is, it is not the absolute value, but is a signed value instead. Therefore, if you performed a BLACS absolute value maximum combine on the numbers -5, 3, 1, 8 the result would be -8. The initial symbol ? in the routine names below masks the data type: i integer s single precision real 16 Intel® Math Kernel Library Reference Manual 2550 d double precision real c single precision complex z double precision complex. BLACS Combines Routine name Results of operation gamx2d Entries of result matrix will have the value of the greatest absolute value found in that position. gamn2d Entries of result matrix will have the value of the smallest absolute value found in that position. gsum2d Entries of result matrix will have the summation of that position. ?gamx2d Performs element-wise absolute value maximization. Syntax call igamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call sgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call dgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call cgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call zgamx2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be compared with to produce the maximum. lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rcflag INTEGER. If rcflag = -1, the arrays ra and ca are not referenced and need not exist. Otherwise, rcflag indicates the leading dimension of these arrays, and so must be = m. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. BLACS Routines 16 2551 Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. ra INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the maximum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. ca INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the maximum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. Description This routine performs element-wise absolute value maximization, that is, each element of matrix A is compared with the corresponding element of the other process's matrices. Note that the value of A is returned, but the absolute value is used to determine the maximum (the 1-norm is used for complex numbers). Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example ?gamn2d Performs element-wise absolute value minimization. Syntax call igamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call sgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call dgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call cgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) call zgamn2d( icontxt, scope, top, m, n, a, lda, ra, ca, rcflag, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be compared with to produce the minimum. 16 Intel® Math Kernel Library Reference Manual 2552 lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rcflag INTEGER. If rcflag = -1, the arrays ra and ca are not referenced and need not exist. Otherwise, rcflag indicates the leading dimension of these arrays, and so must be = m. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. ra INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the minimum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. ca INTEGER array (rcflag, n). If rcflag = -1, this array will not be referenced, and need not exist. Otherwise, it is an integer array (of size at least rcflag x n) indicating the row index of the process that provided the minimum. If the calling process is not selected to receive the result, this array will contain intermediate (useless) results. Description This routine performs element-wise absolute value minimization, that is, each element of matrix A is compared with the corresponding element of the other process's matrices. Note that the value of A is returned, but the absolute value is used to determine the minimum (the 1-norm is used for complex numbers). Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example ?gsum2d Performs element-wise summation. Syntax call igsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call sgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call dgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) call cgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) BLACS Routines 16 2553 call zgsum2d( icontxt, scope, top, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the combine should proceed on. Limited to ROW, COLUMN, or ALL. top CHARACTER*1. Communication pattern to use during the combine operation. m INTEGER. The number of matrix rows to be combined. n INTEGER. The number of matrix columns to be combined. a TYPE array (lda, n). Matrix to be added to produce the sum. lda INTEGER. The leading dimension of the matrix A, that is, the distance between two successive elements in a matrix row. rdest INTEGER. The process row coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. cdest INTEGER. The process column coordinate of the process that should receive the result. If rdest or cdest = -1, all processes within the indicated scope receive the answer. Output Parameters a TYPE array (lda, n). Contains the result if this process is selected to receive the answer, or intermediate results if the process is not selected to receive the result. Description This routine performs element-wise summation, that is, each element of matrix A is summed with the corresponding element of the other process's matrices. Combines may be globally-blocking, so they must be programmed as if no process returns until all have called the routine. See Also BLACS Routines Usage Example BLACS Point To Point Communication This section describes BLACS routines for point to point communication. Point to point communication requires two complementary operations. The send operation produces a message that is then consumed by the receive operation. These operations have various resources associated with them. The main such resource is the buffer that holds the data to be sent or serves as the area where the incoming data is to be received. The level of blocking indicates what correlation the return from a send/receive operation has with the availability of these resources and with the status of message. Non-blocking The return from the send or receive operations does not imply that the resources may be reused, that the message has been sent/received or that the complementary operation has been called. Return means only that the send/receive has been started, and will be completed at some later date. Polling is required to determine when the operation has finished. 16 Intel® Math Kernel Library Reference Manual 2554 In non-blocking message passing, the concept of communication/computation overlap (abbreviated C/C overlap) is important. If a system possesses C/C overlap, independent computation can occur at the same time as communication. That means a nonblocking operation can be posted, and unrelated work can be done while the message is sent/received in parallel. If C/C overlap is not present, after returning from the routine call, computation will be interrupted at some later date when the message is actually sent or received. Locally-blocking Return from the send or receive operations indicates that the resources may be reused. However, since this only depends on local information, it is unknown whether the complementary operation has been called. There are no locally-blocking receives: the send must be completed before the receive buffer is available for re-use. If a receive has not been posted at the time a locally-blocking send is issued, buffering will be required to avoid losing the message. Buffering can be done on the sending process, the receiving process, or not done at all, losing the message. Globally-blocking Return from a globally-blocking procedure indicates that the operation resources may be reused, and that complement of the operation has at least been posted. Since the receive has been posted, there is no buffering required for globally-blocking sends: the message is always sent directly into the user's receive buffer. Almost all processors support non-blocking communication, as well as some other level of blocking sends. What level of blocking the send possesses varies between platforms. For instance, the Intel® processors support locally-blocking sends, with buffering done on the receiving process. This is a very important distinction, because codes written assuming locally-blocking sends will hang on platforms with globallyblocking sends. Below is a simple example of how this can occur: IAM = MY_PROCESS_ID() IF (IAM .EQ. 0) THEN SEND TO PROCESS 1 RECV FROM PROCESS 1 ELSE IF (IAM .EQ. 1) THEN SEND TO PROCESS 0 RECV FROM PROCESS 0 END IF If the send is globally-blocking, process 0 enters the send, and waits for process 1 to start its receive before continuing. In the meantime, process 1 starts to send to 0, and waits for 0 to receive before continuing. Both processes are now waiting on each other, and the program will never continue. The solution for this case is obvious. One of the processes simply reverses the order of its communication calls and the hang is avoided. However, when the communication is not just between two processes, but rather involves a hierarchy of processes, determining how to avoid this kind of difficulty can become problematic. For this reason, it was decided the BLACS would support locally-blocking sends. On systems natively supporting globally-blocking sends, non-blocking sends coupled with buffering is used to simulate locallyblocking sends. The BLACS support globally-blocking receives. In addition, the BLACS specify that point to point messages between two given processes will be strictly ordered. If process 0 sends three messages (label them A, B, and C) to process 1, process 1 must receive A before it can receive B, and message C can be received only after both A and B. The main reason for this restriction is that it allows for the computation of message identifiers. Note, however, that messages from different processes are not ordered. If processes 0, . . ., 3 send messages A, . . ., D to process 4, process 4 may receive these messages in any order that is convenient. BLACS Routines 16 2555 Convention The convention used in the communication routine names follows the template ?xxyy2d, where the letter in the ? position indicates the data type being sent, xx is replaced to indicate the shape of the matrix, and the yy positions are used to indicate the type of communication to perform: i integer s single precision real d double precision real c single precision complex z double precision complex ge The data to be communicated is stored in a general rectangular matrix. tr The data to be communicated is stored in a trapezoidal matrix. sd Send. One process sends to another. rv Receive. One process receives from another. BLACS Point To Point Communication Routine name Operation performed gesd2d trsd2d Take the indicated matrix and send it to the destination process. gerv2d trrv2d Receive a message from the process into the matrix. As a simple example, the pseudo code given above is rewritten below in terms of the BLACS. It is further specifed that the data being exchanged is the double precision vector X, which is 5 elements long. CALL GRIDINFO(NPROW, NPCOL, MYPROW, MYPCOL) IF (MYPROW.EQ.0 .AND. MYPCOL.EQ.0) THEN CALL DGESD2D(5, 1, X, 5, 1, 0) CALL DGERV2D(5, 1, X, 5, 1, 0) ELSE IF (MYPROW.EQ.1 .AND. MYPCOL.EQ.0) THEN CALL DGESD2D(5, 1, X, 5, 0, 0) CALL DGERV2D(5, 1, X, 5, 0, 0) END IF ?gesd2d Takes a general rectangular matrix and sends it to the destination process. Syntax call igesd2d( icontxt, m, n, a, lda, rdest, cdest ) call sgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call dgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call cgesd2d( icontxt, m, n, a, lda, rdest, cdest ) call zgesd2d( icontxt, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. 16 Intel® Math Kernel Library Reference Manual 2556 rdest INTEGER. The process row coordinate of the process to send the message to. cdest INTEGER. The process column coordinate of the process to send the message to. Description This routine takes the indicated general rectangular matrix and sends it to the destination process located at {RDEST, CDEST} in the process grid. Return from the routine indicates that the buffer (the matrix A) may be reused. The routine is locally-blocking, that is, it will return even if the corresponding receive is not posted. See Also BLACS Routines Usage Example ?trsd2d Takes a trapezoidal matrix and sends it to the destination process. Syntax call itrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call strsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call dtrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call ctrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) call ztrsd2d( icontxt, uplo, diag, m, n, a, lda, rdest, cdest ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. uplo, diag, m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. rdest INTEGER. The process row coordinate of the process to send the message to. cdest INTEGER. The process column coordinate of the process to send the message to. Description This routine takes the indicated trapezoidal matrix and sends it to the destination process located at {RDEST, CDEST} in the process grid. Return from the routine indicates that the buffer (the matrix A) may be reused. The routine is locally-blocking, that is, it will return even if the corresponding receive is not posted. ?gerv2d Receives a message from the process into the general rectangular matrix. Syntax call igerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call sgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call dgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) call cgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) BLACS Routines 16 2557 call zgerv2d( icontxt, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the source of the message. csrc INTEGER. The process column coordinate of the source of the message. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives a message from process {RSRC, CSRC} into the general rectangular matrix A. This routine is globally-blocking, that is, return from the routine indicates that the message has been received into A. See Also BLACS Routines Usage Example ?trrv2d Receives a message from the process into the trapezoidal matrix. Syntax call itrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call strrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call dtrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call ctrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) call ztrrv2d( icontxt, uplo, diag, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. uplo, diag, m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the source of the message. csrc INTEGER. The process column coordinate of the source of the message. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives a message from process {RSRC, CSRC} into the trapezoidal matrix A. This routine is globally-blocking, that is, return from the routine indicates that the message has been received into A. 16 Intel® Math Kernel Library Reference Manual 2558 BLACS Broadcast Routines This section describes BLACS broadcast routines. A broadcast sends data possessed by one process to all processes within a scope. Broadcast, much like point to point communication, has two complementary operations. The process that owns the data to be broadcast issues a broadcast/send. All processes within the same scope must then issue the complementary broadcast/receive. The BLACS define that both broadcast/send and broadcast/receive are globally-blocking. Broadcasts/ receives cannot be locally-blocking since they must post a receive. Note that receives cannot be locallyblocking. When a given process can leave, a broadcast/receive operation is topology dependent, so, to avoid a hang as topology is varied, the broadcast/receive must be treated as if no process can leave until all processes have called the operation. Broadcast/sends could be defined to be locally-blocking. Since no information is being received, as long as locally-blocking point to point sends are used, the broadcast/send will be locally blocking. However, defining one process within a scope to be locally-blocking while all other processes are globally-blocking adds little to the programmability of the code. On the other hand, leaving the option open to have globally-blocking broadcast/sends may allow for optimization on some platforms. The fact that broadcasts are defined as globally-blocking has several important implications. The first is that scoped operations (broadcasts or combines) must be strictly ordered, that is, all processes within a scope must agree on the order of calls to separate scoped operations. This constraint falls in line with that already in place for the computation of message IDs, and is present in point to point communication as well. A less obvious result is that scoped operations with SCOPE = 'ALL' must be ordered with respect to any other scoped operation. This means that if there are two broadcasts to be done, one along a column, and one involving the entire process grid, all processes within the process column issuing the column broadcast must agree on which broadcast will be performed first. The convention used in the communication routine names follows the template ?xxyy2d, where the letter in the ? position indicates the data type being sent, xx is replaced to indicate the shape of the matrix, and the yy positions are used to indicate the type of communication to perform: i integer s single precision real d double precision real c single precision complex z double precision complex ge The data to be communicated is stored in a general rectangular matrix. tr The data to be communicated is stored in a trapezoidal matrix. bs Broadcast/send. A process begins the broadcast of data within a scope. br Broadcast/receive A process receives and participates in the broadcast of data within a scope. BLACS Broadcast Routines Routine name Operation performed gebs2d trbs2d Start a broadcast along a scope. gebr2d trbr2d Receive and participate in a broadcast along a scope. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for BLACS Routines 16 2559 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ?gebs2d Starts a broadcast along a scope for a general rectangular matrix. Syntax call igebs2d( icontxt, scope, top, m, n, a, lda ) call sgebs2d( icontxt, scope, top, m, n, a, lda ) call dgebs2d( icontxt, scope, top, m, n, a, lda ) call cgebs2d( icontxt, scope, top, m, n, a, lda ) call zgebs2d( icontxt, scope, top, m, n, a, lda ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. Description This routine starts a broadcast along a scope. All other processes within the scope must call broadcast/ receive for the broadcast to proceed. At the end of a broadcast, all processes within the scope will possess the data in the general rectangular matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). See Also BLACS Routines Usage Example ?trbs2d Starts a broadcast along a scope for a trapezoidal matrix. Syntax call itrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call strbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call dtrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call ctrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) call ztrbs2d( icontxt, scope, top, uplo, diag, m, n, a, lda ) 16 Intel® Math Kernel Library Reference Manual 2560 Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. uplo, diag, m, n, a, lda Describe the matrix to be sent. See Matrix Shapes for details. Description This routine starts a broadcast along a scope. All other processes within the scope must call broadcast/ receive for the broadcast to proceed. At the end of a broadcast, all processes within the scope will possess the data in the trapezoidal matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). ?gebr2d Receives and participates in a broadcast along a scope for a general rectangular matrix. Syntax call igebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call sgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call dgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call cgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) call zgebr2d( icontxt, scope, top, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the process that called broadcast/send. csrc INTEGER. The process column coordinate of the process that called broadcast/send. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives and participates in a broadcast along a scope. At the end of a broadcast, all processes within the scope will possess the data in the general rectangular matrix A. Broadcasts may be globallyblocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). BLACS Routines 16 2561 See Also BLACS Routines Usage Example ?trbr2d Receives and participates in a broadcast along a scope for a trapezoidal matrix. Syntax call itrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call strbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call dtrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call ctrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) call ztrbr2d( icontxt, scope, top, uplo, diag, m, n, a, lda, rsrc, csrc ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Indicates what scope the broadcast should proceed on. Limited to 'Row', 'Column', or 'All'. top CHARACTER*1. Indicates the communication pattern to use for the broadcast. uplo, diag, m, n, lda Describe the matrix to be sent. See Matrix Shapes for details. rsrc INTEGER. The process row coordinate of the process that called broadcast/send. csrc INTEGER. The process column coordinate of the process that called broadcast/send. Output Parameters a An array of dimension (lda,n) to receive the incoming message into. Description This routine receives and participates in a broadcast along a scope. At the end of a broadcast, all processes within the scope will possess the data in the trapezoidal matrix A. Broadcasts may be globally-blocking. This means no process is guaranteed to return from a broadcast until all processes in the scope have called the appropriate routine (broadcast/send or broadcast/receive). BLACS Support Routines The support routines perform distinct tasks that can be used for: Initialization Destruction Information Purposes Miscellaneous Tasks. Initialization Routines This section describes BLACS routines that deal with grid/context creation, and processing before the grid/ context has been defined. 16 Intel® Math Kernel Library Reference Manual 2562 BLACS Initialization Routines Routine name Operation performed blacs_pinfo Returns the number of processes available for use. blacs_setup Allocates virtual machine and spawns processes. blacs_get Gets values that BLACS use for internal defaults. blacs_set Sets values that BLACS use for internal defaults. blacs_gridinit Assigns available processes into BLACS process grid. blacs_gridmap Maps available processes into BLACS process grid. blacs_pinfo Returns the number of processes available for use. Syntax call blacs_pinfo( mypnum, nprocs ) Output Parameters mypnum INTEGER. An integer between 0 and (nprocs - 1) that uniquely identifies each process. nprocs INTEGER.The number of processes available for BLACS use. Description This routine is used when some initial system information is required before the BLACS are set up. On all platforms except PVM, nprocs is the actual number of processes available for use, that is, nprows * npcols <= nprocs. In PVM, the virtual machine may not have been set up before this call, and therefore no parallel machine exists. In this case, nprocs is returned as less than one. If a process has been spawned via the keyboard, it receives mypnum of 0, and all other processes get mypnum of -1. As a result, the user can distinguish between processes. Only after the virtual machine has been set up via a call to BLACS_SETUP, this routine returns the correct values for mypnum and nprocs. See Also BLACS Routines Usage Example blacs_setup Allocates virtual machine and spawns processes. Syntax call blacs_setup( mypnum, nprocs ) Input Parameters nprocs INTEGER. On the process spawned from the keyboard rather than from pvmspawn, this parameter indicates the number of processes to create when building the virtual machine. Output Parameters mypnum INTEGER. An integer between 0 and (nprocs - 1) that uniquely identifies each process. BLACS Routines 16 2563 nprocs INTEGER. For all processes other than spawned from the keyboard, this parameter means the number of processes available for BLACS use. Description This routine only accomplishes meaningful work in the PVM BLACS. On all other platforms, it is functionally equivalent to blacs_pinfo. The BLACS assume a static system, that is, the given number of processes does not change. PVM supplies a dynamic system, allowing processes to be added to the system on the fly. blacs_setup is used to allocate the virtual machine and spawn off processes. It reads in a file called blacs_setup.dat, in which the first line must be the name of your executable. The second line is optional, but if it exists, it should be a PVM spawn flag. Legal values at this time are 0 (PvmTaskDefault), 4 (PvmTaskDebug), 8 (PvmTaskTrace), and 12 (PvmTaskDebug + PvmTaskTrace). The primary reason for this line is to allow the user to easily turn on and off PVM debugging. Additional lines, if any, specify what machines should be added to the current configuration before spawning nprocs-1 processes to the machines in a round robin fashion. nprocs is input on the process which has no PVM parent (that is, mypnum=0), and both parameters are output for all processes. So, on PVM systems, the call to blacs_pinfo informs you that the virtual machine has not been set up, and a call to blacs_setup then sets up the machine and returns the real values for mypnum and nprocs. Note that if the file blacs_setup.dat does not exist, the BLACS prompt the user for the executable name, and processes are spawned to the current PVM configuration. See Also BLACS Routines Usage Example blacs_get Gets values that BLACS use for internal defaults. Syntax call blacs_get( icontxt, what, val ) Input Parameters icontxt INTEGER. On values of what that are tied to a particular context, this parameter is the integer handle indicating the context. Otherwise, ignored. what INTEGER. Indicates what BLACS internal(s) should be returned in val. Present options are: • what = 0 : Handle indicating default system context • what = 1 : The BLACS message ID range • what = 2 : The BLACS debug level the library was compiled with • what = 10 : Handle indicating the system context used to define the BLACS context whose handle is icontxt • what = 11 : Number of rings multiring topology is presently using • what = 12 : Number of branches general tree topology is presently using. Output Parameters val INTEGER. The value of the BLACS internal. 16 Intel® Math Kernel Library Reference Manual 2564 Description This routine gets the values that the BLACS are using for internal defaults. Some values are tied to a BLACS context, and some are more general. The most common use is in retrieving a default system context for input into blacs_gridinit or blacs_gridmap. Some systems, such as MPI*, supply their own version of context. For those users who mix system code with BLACS code, a BLACS context should be formed in reference to a system context. Thus, the grid creation routines take a system context as input. If you wish to have strictly portable code, you may use blacs_get to retrieve a default system context that will include all available processes. This value is not tied to a BLACS context, so the parameter icontxt is unused. blacs_get returns information on three quantities that are tied to an individual BLACS context, which is passed in as icontxt. The information that may be retrieved is: • The handle of the system context upon which this BLACS context was defined • The number of rings for TOP = 'M' (multiring broadcast) • The number of branches for TOP = 'T' (general tree broadcast/general tree gather). See Also BLACS Routines Usage Example blacs_set Sets values that BLACS use for internal defaults. Syntax call blacs_set( icontxt, what, val ) Input Parameters icontxt INTEGER. For values of what that are tied to a particular context, this parameter is the integer handle indicating the context. Otherwise, ignored. what INTEGER. Indicates what BLACS internal(s) should be set. Present values are: • 1 = The BLACS message ID range • 11 = Number of rings for multiring topology to use • 12 = Number of branches for general tree topology to use. val INTEGER. Array of dimension (*). Indicates the value(s) the internals should be set to. The specific meanings depend on what values, as discussed in Description below. Description This routine sets the BLACS internal defaults depending on what values: what = 1 Setting the BLACS message ID range. If you wish to mix the BLACS with other message-passing packages, restrict the BLACS to a certain message ID range not to be used by the non-BLACS routines. The message ID range must be set before the first call to blacs_gridinit or blacs_gridmap. Subsequent calls will have no effect. Because the message ID range is not tied to a particular context, the parameter icontxt is ignored, and val is defined as: VAL (input) INTEGER array of dimension (2) VAL(1) : The smallest message ID (also called message type or message tag) the BLACS should use. BLACS Routines 16 2565 VAL(2) : The largest message ID (also called message type or message tag) the BLACS should use. what = 11 Set number of rings for TOP = 'M' (multiring broadcast).This quantity is tied to a context, so icontxt is used, and val is defined as: VAL (input) INTEGER array of dimension (1) VAL(1) : The number of rings for multiring topology to use. what = 12 Set number of rings for TOP = 'T' (general tree broadcast/general tree gather). This quantity is tied to a context, so icontxt is used, and val is defined as: VAL (input) INTEGER array of dimension (1) VAL(1) : The number of branches for general tree topology to use. blacs_gridinit Assigns available processes into BLACS process grid. Syntax call blacs_gridinit( icontxt, order, nprow, npcol ) Input Parameters icontxt INTEGER. Integer handle indicating the system context to be used in creating the BLACS context. Call blacs_get to obtain a default system context. order CHARACTER*1. Indicates how to map processes to BLACS grid. Options are: • 'R' : Use row-major natural ordering • 'C' : Use column-major natural ordering • ELSE : Use row-major natural ordering nprow INTEGER. Indicates how many process rows the process grid should contain. npcol INTEGER. Indicates how many process columns the process grid should contain. Output Parameters icontxt INTEGER. Integer handle to the created BLACS context. Description All BLACS codes must call this routine, or its sister routine blacs_gridmap. These routines take the available processes, and assign, or map, them into a BLACS process grid. In other words, they establish how the BLACS coordinate system maps into the native machine process numbering system. Each BLACS grid is contained in a context, so that it does not interfere with distributed operations that occur within other grids/ contexts. These grid creation routines may be called repeatedly to define additional contexts/grids. The creation of a grid requires input from all processes that are defined to be in this grid. Processes belonging to more than one grid have to agree on which grid formation will be serviced first, much like the globally blocking sum or broadcast. These grid creation routines set up various internals for the BLACS, and one of them must be called before any calls are made to the non-initialization BLACS. Note that these routines map already existing processes to a grid: the processes are not created dynamically. On most parallel machines, the processes are actual processors (hardware), and they are "created" when you run your executable. When using the PVM BLACS, if the virtual machine has not been set up yet, the routine blacs_setup should be used to create the virtual machine. 16 Intel® Math Kernel Library Reference Manual 2566 This routine creates a simple nprow x npcol process grid. This process grid uses the first nprow * npcol processes, and assigns them to the grid in a row- or column-major natural ordering. If these process-to-grid mappings are unacceptable, call blacs_gridmap. See Also BLACS Routines Usage Example blacs_get blacs_gridmap blacs_setup blacs_gridmap Maps available processes into BLACS process grid. Syntax call blacs_gridmap( icontxt, usermap, ldumap, nprow, npcol ) Input Parameters icontxt INTEGER. Integer handle indicating the system context to be used in creating the BLACS context. Call blacs_get to obtain a default system context. usermap INTEGER. Array, dimension (ldumap, npcol), indicating the process-to-grid mapping. ldumap INTEGER. Leading dimension of the 2D array usermap. ldumap = nprow. nprow INTEGER. Indicates how many process rows the process grid should contain. npcol INTEGER. Indicates how many process columns the process grid should contain. Output Parameters icontxt INTEGER. Integer handle to the created BLACS context. Description All BLACS codes must call this routine, or its sister routine blacs_gridinit. These routines take the available processes, and assign, or map, them into a BLACS process grid. In other words, they establish how the BLACS coordinate system maps into the native machine process numbering system. Each BLACS grid is contained in a context, so that it does not interfere with distributed operations that occur within other grids/ contexts. These grid creation routines may be called repeatedly to define additional contexts/grids. The creation of a grid requires input from all processes that are defined to be in this grid. Processes belonging to more than one grid have to agree on which grid formation will be serviced first, much like the globally blocking sum or broadcast. These grid creation routines set up various internals for the BLACS, and one of them must be called before any calls are made to the non-initialization BLACS. Note that these routines map already existing processes to a grid: the processes are not created dynamically. On most parallel machines, the processes are actual processors (hardware), and they are "created" when you run your executable. When using the PVM BLACS, if the virtual machine has not been set up yet, the routine blacs_setup should be used to create the virtual machine. This routine allows the user to map processes to the process grid in an arbitrary manner. usermap(i,j) holds the process number of the process to be placed in {i, j} of the process grid. On most distributed systems, this process number is a machine defined number between 0 ... nprow-1. For PVM, these node numbers are the PVM TIDS (Task IDs). The blacs_gridmap routine is intended for an experienced user. The blacs_gridinit routine is much simpler. blacs_gridinit simply performs a gridmap where the first BLACS Routines 16 2567 nprow * npcol processes are mapped into the current grid in a row-major natural ordering. If you are an experienced user, blacs_gridmap allows you to take advantage of your system's actual layout. That is, you can map nodes that are physically connected to be neighbors in the BLACS grid, etc. The blacs_gridmap routine also opens the way for multigridding: you can separate your nodes into arbitrary grids, join them together at some later date, and then re-split them into new grids. blacs_gridmap also provides the ability to make arbitrary grids or subgrids (for example, a "nearest neighbor" grid), which can greatly facilitate operations among processes that do not fall on a row or column of the main process grid. See Also BLACS Routines Usage Example blacs_get blacs_gridinit blacs_setup Destruction Routines This section describes BLACS routines that destroy grids, abort processes, and free resources. BLACS Destruction Routines Routine name Operation performed blacs_freebuff Frees BLACS buffer. blacs_gridexit Frees a BLACS context. blacs_abort Aborts all processes. blacs_exit Frees all BLACS contexts and releases all allocated memory. blacs_freebuff Frees BLACS buffer. Syntax call blacs_freebuff( icontxt, wait ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context. wait INTEGER. Parameter indicating whether to wait for non-blocking operations or not. If equals 0, the operations should not be waited for; free only unused buffers. Otherwise, wait in order to free all buffers. Description This routine releases the BLACS buffer. The BLACS have at least one internal buffer that is used for packing messages. The number of internal buffers depends on what platform you are running the BLACS on. On systems where memory is tight, keeping this buffer or buffers may become expensive. Call freebuff to release the buffer. However, the next call of a communication routine that requires packing reallocates the buffer. The wait parameter determines whether the BLACS should wait for any non-blocking operations to be completed or not. If wait = 0, the BLACS free any buffers that can be freed without waiting. If wait is not 0, the BLACS free all internal buffers, even if non-blocking operations must be completed first. 16 Intel® Math Kernel Library Reference Manual 2568 blacs_gridexit Frees a BLACS context. Syntax call blacs_gridexit( icontxt ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context to be freed. Description This routine frees a BLACS context. Release the resources when contexts are no longer needed. After freeing a context, the context no longer exists, and its handle may be re-used if new contexts are defined. blacs_abort Aborts all processes. Syntax call blacs_abort( icontxt, errornum ) Input Parameters icontxt INTEGER. Integer handle that indicates the BLACS context to be aborted. errornum INTEGER. User-defined integer error number. Description This routine aborts all the BLACS processes, not only those confined to a particular context. Use blacs_abort to abort all the processes in case of a serious error. Note that both parameters are input, but the routine uses them only in printing out the error message. The context handle passed in is not required to be a valid context handle. blacs_exit Frees all BLACS contexts and releases all allocated memory. Syntax call blacs_exit( continue ) Input Parameters continue INTEGER. Flag indicating whether message passing continues after the BLACS are done. If continue is non-zero, the user is assumed to continue using the machine after completing the BLACS. Otherwise, no message passing is assumed after calling this routine. Description This routine frees all BLACS contexts and releases all allocated memory. This routine should be called when a process has finished all use of the BLACS. The continue parameter indicates whether the user will be using the underlying communication platform after the BLACS are finished. This information is most important for the PVM BLACS. If continue is set to 0, then pvm_exit is called; BLACS Routines 16 2569 otherwise, it is not called. Setting continue not equal to 0 indicates that explicit PVM send/recvs will be called after the BLACS routines are used. Make sure your code calls pvm_exit. PVM users should either call blacs_exit or explicitly call pvm_exit to avoid PVM problems. See Also BLACS Routines Usage Example Informational Routines This section describes BLACS routines that return information involving the process grid. BLACS Informational Routines Routine name Operation performed blacs_gridinfo Returns information on the current grid. blacs_pnum Returns the system process number of the process in the process grid. blacs_pcoord Returns the row and column coordinates in the process grid. blacs_gridinfo Returns information on the current grid. Syntax call blacs_gridinfo( icontxt, nprow, npcol, myprow, mypcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. Output Parameters nprow INTEGER. Number of process rows in the current process grid. npcol INTEGER. Number of process columns in the current process grid. myprow INTEGER. Row coordinate of the calling process in the process grid. mypcol INTEGER. Column coordinate of the calling process in the process grid. Description This routine returns information on the current grid. If the context handle does not point at a valid context, all quantities are returned as -1. See Also BLACS Routines Usage Example blacs_pnum Returns the system process number of the process in the process grid. Syntax call blacs_pnum( icontxt, prow, pcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. 16 Intel® Math Kernel Library Reference Manual 2570 prow INTEGER. Row coordinate of the process the system process number of which is to be determined. pcol INTEGER. Column coordinate of the process the system process number of which is to be determined. Description This function returns the system process number of the process at {PROW, PCOL} in the process grid. See Also BLACS Routines Usage Example blacs_pcoord Returns the row and column coordinates in the process grid. Syntax call blacs_pcoord( icontxt, pnum, prow, pcol ) Input Parameters icontxt INTEGER. Integer handle that indicates the context. pnum INTEGER. Process number the coordinates of which are to be determined. This parameter stand for the process number of the underlying machine, that is, it is a tid for PVM. Output Parameters prow INTEGER. Row coordinates of the pnum process in the BLACS grid. pcol INTEGER. Column coordinates of the pnum process in the BLACS grid. Description Given the system process number, this function returns the row and column coordinates in the BLACS process grid. See Also BLACS Routines Usage Example Miscellaneous Routines This section describes blacs_barrier routine. BLACS Informational Routines Routine name Operation performed blacs_barrier Holds up execution of all processes within the indicated scope until they have all called the routine. blacs_barrier Holds up execution of all processes within the indicated scope. Syntax call blacs_barrier( icontxt, scope ) BLACS Routines 16 2571 Input Parameters icontxt INTEGER. Integer handle that indicates the context. scope CHARACTER*1. Parameter that indicates whether a process row (scope='R'), column ('C'), or entire grid ('A') will participate in the barrier. Description This routine holds up execution of all processes within the indicated scope until they have all called the routine. Examples of BLACS Routines Usage Example. BLACS Usage. Hello World The following routine takes the available processes, forms them into a process grid, and then has each process check in with the process at {0,0} in the process grid. PROGRAM HELLO * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * Performs a simple check-in type hello world * .. * .. External Functions .. INTEGER BLACS_PNUM EXTERNAL BLACS_PNUM * .. * .. Variable Declaration .. INTEGER CONTXT, IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL INTEGER ICALLER, I, J, HISROW, HISCOL * * Determine my process number and the number of processes in * machine * CALL BLACS_PINFO(IAM, NPROCS) * * If in PVM, create virtual machine if it doesn't exist * IF (NPROCS .LT. 1) THEN IF (IAM .EQ. 0) THEN WRITE(*, 1000) READ(*, 2000) NPROCS END IF CALL BLACS_SETUP(IAM, NPROCS) END IF * * Set up process grid that is as close to square as possible * NPROW = INT( SQRT( REAL(NPROCS) ) ) NPCOL = NPROCS / NPROW * * Get default system context, and define grid * CALL BLACS_GET(0, 0, CONTXT) CALL BLACS_GRIDINIT(CONTXT, 'Row', NPROW, NPCOL) CALL BLACS_GRIDINFO(CONTXT, NPROW, NPCOL, MYPROW, MYPCOL) * * If I'm not in grid, go to end of program * IF ( (MYPROW.GE.NPROW) .OR. (MYPCOL.GE.NPCOL) ) GOTO 30 * * Get my process ID from my grid coordinates * 16 Intel® Math Kernel Library Reference Manual 2572 ICALLER = BLACS_PNUM(CONTXT, MYPROW, MYPCOL) * * If I am process {0,0}, receive check-in messages from * all nodes * IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN WRITE(*,*) ' ' DO 20 I = 0, NPROW-1 DO 10 J = 0, NPCOL-1 IF ( (I.NE.0) .OR. (J.NE.0) ) THEN CALL IGERV2D(CONTXT, 1, 1, ICALLER, 1, I, J) END IF * * Make sure ICALLER is where we think in process grid * CALL BLACS_PCOORD(CONTXT, ICALLER, HISROW, HISCOL) IF ( (HISROW.NE.I) .OR. (HISCOL.NE.J) ) THEN WRITE(*,*) 'Grid error! Halting . . .' STOP END IF WRITE(*, 3000) I, J, ICALLER 10 CONTINUE 20 CONTINUE WRITE(*,*) ' ' WRITE(*,*) 'All processes checked in. Run finished.' * * All processes but {0,0} send process ID as a check-in * ELSE CALL IGESD2D(CONTXT, 1, 1, ICALLER, 1, 0, 0) END IF 30 CONTINUE CALL BLACS_EXIT(0) 1000 FORMAT('How many processes in machine?') 2000 FORMAT(I) 3000 FORMAT('Process {',i2,',',i2,'} (node number =',I, $ ') has checked in.') STOP END Example. BLACS Usage. PROCMAP This routine maps processes to a grid using blacs_gridmap. SUBROUTINE PROCMAP(CONTEXT, MAPPING, BEGPROC, NPROW, NPCOL, IMAP) * * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * .. * .. Scalar Arguments .. INTEGER CONTEXT, MAPPING, BEGPROC, NPROW, NPCOL BLACS Routines 16 2573 * .. * .. Array Arguments .. INTEGER IMAP(NPROW, *) * .. * * Purpose * ======= * PROCMAP maps NPROW*NPCOL processes starting from process BEGPROC to * the grid in a variety of ways depending on the parameter MAPPING. * * Arguments * ========= * * CONTEXT (output) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * MAPPING (input) INTEGER * Way to map processes to grid. Choices are: * 1 : row-major natural ordering * 2 : column-major natural ordering * * BEGPROC (input) INTEGER * The process number (between 0 and NPROCS-1) to use as * {0,0}. From this process, processes will be assigned * to the grid as indicated by MAPPING. * * NPROW (input) INTEGER * The number of process rows the created grid * should have. * * NPCOL (input) INTEGER * The number of process columns the created grid * should have. * * IMAP (workspace) INTEGER array of dimension (NPROW, NPCOL) * Workspace, where the array which maps the * processes to the grid will be stored for the * call to GRIDMAP. * * =============================================================== * * .. * .. External Functions .. INTEGER BLACS_PNUM EXTERNAL BLACS_PNUM * .. * .. External Subroutines .. EXTERNAL BLACS_PINFO, BLACS_GRIDINIT, BLACS_GRIDMAP * .. * .. Local Scalars .. INTEGER TMPCONTXT, NPROCS, I, J, K * .. * .. Executable Statements .. * * See how many processes there are in the system * CALL BLACS_PINFO( I, NPROCS ) 16 Intel® Math Kernel Library Reference Manual 2574 IF (NPROCS-BEGPROC .LT. NPROW*NPCOL) THEN WRITE(*,*) 'Not enough processes for grid' STOP END IF * * Temporarily map all processes into 1 x NPROCS grid * CALL BLACS_GET( 0, 0, TMPCONTXT ) CALL BLACS_GRIDINIT( TMPCONTXT, 'Row', 1, NPROCS ) K = BEGPROC * * If we want a row-major natural ordering * IF (MAPPING .EQ. 1) THEN DO I = 1, NPROW DO J = 1, NPCOL IMAP(I, J) = BLACS_PNUM(TMPCONTXT, 0, K) K = K + 1W END DO END DO * * If we want a column-major natural ordering * ELSE IF (MAPPING .EQ. 2) THEN DO J = 1, NPCOL DO I = 1, NPROW IMAP(I, J) = BLACS_PNUM(TMPCONTXT, 0, K) K = K + 1 END DO END DO ELSE WRITE(*,*) 'Unknown mapping.' STOP END IF * * Free temporary context * CALL BLACS_GRIDEXIT(TMPCONTXT) * * Apply the new mapping to form desired context * CALL BLACS_GET( 0, 0, CONTEXT ) CALL BLACS_GRIDMAP( CONTEXT, IMAP, NPROW, NPROW, NPCOL ) RETURN END BLACS Routines 16 2575 Example. BLACS Usage. PARALLEL DOT PRODUCT This routine does a bone-headed parallel double precision dot product of two vectors. Arguments are input on process {0,0}, and output everywhere else. DOUBLE PRECISION FUNCTION PDDOT( CONTEXT, N, X, Y ) * * -- BLACS example code -- * Written by Clint Whaley 7/26/94 * .. * .. Scalar Arguments .. INTEGER CONTEXT, N * .. * .. Array Arguments .. DOUBLE PRECISION X(*), Y(*) * .. * * Purpose * ======= * PDDOT is a restricted parallel version of the BLAS routine * DDOT. It assumes that the increment on both vectors is one, * and that process {0,0} starts out owning the vectors and * has N. It returns the dot product of the two N-length vectors * X and Y, that is, PDDOT = X' Y. * * Arguments * ========= * * CONTEXT (input) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * N (input/output) INTEGER * The length of the vectors X and Y. Input * for {0,0}, output for everyone else. * * X (input/output) DOUBLE PRECISION array of dimension (N) * The vector X of PDDOT = X' Y. Input for {0,0}, * output for everyone else. * * Y (input/output) DOUBLE PRECISION array of dimension (N) * The vector Y of PDDOT = X' Y. Input for {0,0}, * output for everyone else. * * =============================================================== * * .. * .. External Functions .. DOUBLE PRECISION DDOT EXTERNAL DDOT * .. * .. External Subroutines .. EXTERNAL BLACS_GRIDINFO, DGEBS2D, DGEBR2D, DGSUM2D * .. * .. Local Scalars .. INTEGER IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL, I, LN DOUBLE PRECISION LDDOT * .. 16 Intel® Math Kernel Library Reference Manual 2576 * .. Executable Statements .. * * Find out what grid has been set up, and pretend it is 1-D * CALL BLACS_GRIDINFO( CONTXT, NPROW, NPCOL, MYPROW, MYPCOL ) IAM = MYPROW*NPCOL + MYPCOL NPROCS = NPROW * NPCOL * * Temporarily map all processes into 1 x NPROCS grid * CALL BLACS_GET( 0, 0, TMPCONTXT ) CALL BLACS_GRIDINIT( TMPCONTXT, 'Row', 1, NPROCS ) K = BEGPROC * * Do bone-headed thing, and just send entire X and Y to * everyone * IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN CALL IGEBS2D(CONTXT, 'All', 'i-ring', 1, 1, N, 1 ) CALL DGEBS2D(CONTXT, 'All', 'i-ring', N, 1, X, N ) CALL DGEBS2D(CONTXT, 'All', 'i-ring', N, 1, Y, N ) ELSE CALL IGEBR2D(CONTXT, 'All', 'i-ring', 1, 1, N, 1, 0, 0 ) CALL DGEBR2D(CONTXT, 'All', 'i-ring', N, 1, X, N, 0, 0 ) CALL DGEBR2D(CONTXT, 'All', 'i-ring', N, 1, Y, N, 0, 0 ) ENDIF * * Find out the number of local rows to multiply (LN), and * where in vectors to start (I) * LN = N / NPROCS I = 1 + IAM * LN * * Last process does any extra rows * IF (IAM .EQ. NPROCS-1) LN = LN + MOD(N, NPROCS) * * Figure dot product of my piece of X and Y * LDDOT = DDOT( LN, X(I), 1, Y(I), 1 ) * * Add local dot products to get global dot product; * give all procs the answer * CALL DGSUM2D( CONTXT, 'All', '1-tree', 1, 1, LDDOT, 1, -1, 0 ) PDDOT = LDDOT RETURN BLACS Routines 16 2577 END Example. BLACS Usage. PARALLEL MATRIX INFINITY NORM This routine does a parallel infinity norm on a distributed double precision matrix. Unlike the PDDOT example, this routine assumes the matrix has already been distributed. DOUBLE PRECISION FUNCTION PDINFNRM(CONTXT, LM, LN, A, LDA, WORK) * * -- BLACS example code -- * Written by Clint Whaley. * .. * .. Scalar Arguments .. INTEGER CONTEXT, LM, LN, LDA * .. * .. Array Arguments .. DOUBLE PRECISION A(LDA, *), WORK(*) * .. * * Purpose * ======= * Compute the infinity norm of a distributed matrix, where * the matrix is spread across a 2D process grid. The result is * left on all processes. * * Arguments * ========= * * CONTEXT (input) INTEGER * This integer is used by the BLACS to indicate a context. * A context is a universe where messages exist and do not * interact with other context's messages. The context * includes the definition of a grid, and each process's * coordinates in it. * * LM (input) INTEGER * Number of rows of the global matrix owned by this * process. * * LN (input) INTEGER * Number of columns of the global matrix owned by this * process. * * A (input) DOUBLE PRECISION, dimension (LDA,N) * The matrix whose norm you wish to compute. * * LDA (input) INTEGER * Leading Dimension of A. * * WORK (temporary) DOUBLE PRECISION array, dimension (LM) * Temporary work space used for summing rows. * * .. External Subroutines .. EXTERNAL BLACS_GRIDINFO, DGEBS2D, DGEBR2D, DGSUM2D, DGAMX2D * .. * .. External Functions .. INTEGER IDAMAX DOUBLE PRECISION DASUM * 16 Intel® Math Kernel Library Reference Manual 2578 * .. Local Scalars .. INTEGER NPROW, NPCOL, MYROW, MYCOL, I, J DOUBLE PRECISION MAX * * .. Executable Statements .. * * Get process grid information * CALL BLACS_GRIDINFO( CONTXT, NPROW, NPCOL, MYPROW, MYPCOL ) * * Add all local rows together * DO 20 I = 1, LM WORK(I) = DASUM(LN, A(I,1), LDA) 20 CONTINUE * * Find sum of global matrix rows and store on column 0 of * process grid * CALL DGSUM2D(CONTXT, 'Row', '1-tree', LM, 1, WORK, LM, MYROW, 0) * * Find maximum sum of rows for supnorm * IF (MYCOL .EQ. 0) THEN MAX = WORK(IDAMAX(LM,WORK,1)) IF (LM .LT. 1) MAX = 0.0D0 CALL DGAMX2D(CONTXT, 'Col', 'h', 1, 1, MAX, 1, I, I, -1, -1, 0) END IF * * Process column 0 has answer; send answer to all nodes * IF (MYCOL .EQ. 0) THEN CALL DGEBS2D(CONTXT, 'Row', ' ', 1, 1, MAX, 1) ELSE CALL DGEBR2D(CONTXT, 'Row', ' ', 1, 1, MAX, 1, 0, 0) END IF * PDINFNRM = MAX BLACS Routines 16 2579 * RETURN * * End of PDINFNRM * END 16 Intel® Math Kernel Library Reference Manual 2580 Data Fitting Functions 17 Data Fitting functions in Intel® MKL provide spline-based interpolation capabilities that you can use to approximate functions, function derivatives or integrals, and perform cell search operations. The Data Fitting component is task based. The task is a data structure or descriptor that holds the parameters related to a specific Data Fitting operation. You can modify the task parameters using the task editing functionality of the library. For definition of the implemented operations, see Mathematical Conventions. Data Fitting routines use the following workflow to process a task: 1. Create a task or multiple tasks. 2. Modify the task parameters. 3. Perform a Data Fitting computation. 4. Destroy the task or tasks. All Data Fitting functions fall into the following categories: Task Creation and Initialization Routines - routines that create a new Data Fitting task descriptor and initialize the most common parameters, such as partition of the interpolation interval, values of the vectorvalued function, and the parameters describing their structure. Task Editors - routines that set or modify parameters in an existing Data Fitting task. Computational Routines - routines that perform Data Fitting computations, such as construction of a spline, interpolation, computation of derivatives and integrals, and search. Task Destructors - routines that delete Data Fitting task descriptors and deallocate resources. You can access the Data Fitting routines through the Fortran and C89/C99 language interfaces. You can also use the C89 interface with more recent versions of C/C++, or the Fortran 90 interface with programs written in Fortran 95 The ${MKL}/include directory of the Intel® MKL contains the following Data Fitting header files: • C/C++: mkl_df.h • Fortran: mkl_df.f90 and mkl_df.f77 You can find examples that demonstrate C/C++ and Fortran usage of Data Fitting routines in the ${MKL}/ examples/datafittingc and ${MKL}/examples/datafittingf directories, respectively. Naming Conventions The Fortran interfaces of the Data Fitting functions are in lowercase, while the names of the types and constants are in uppercase. The C/C++ interface of the Data Fitting functions, types, and constants are case-sensitive and can be in lowercase, uppercase, and mixed case. The names of all routines have the following structure: df[datatype] where • df is a prefix indicating that the routine belongs to the Data Fitting component of Intel MKL. • [datatype] field specifies the type of the input and/or output data and can be s (for the single precision real type), d (for the double precision real type), or i (for the integer type). This field is omitted in the names of the routines that are not data type dependent. • field specifies the functionality the routine performs. For example, this field can be NewTask1D, Interpolate1D, or DeleteTask. 2581 Data Types The Data Fitting component provides routines for processing single and double precision real data types. The results of cell search operations are returned as a generic integer data type. All Data Fitting routines use the following data type: Type Data Object Fortran: TYPE(DF_TASK) C: DFTaskPtr Pointer to a task NOTE The actual size of the generic integer type is platform-dependent. Before compiling your application, you need to set an appropriate byte size for integers. For details, see section Using the ILP64 Interface vs. LP64 Interface of the Intel® MKL User's Guide. Mathematical Conventions This section explains the notation used for Data Fitting function descriptions. Spline notations are based on the terminology and definitions of [deBoor2001]. The definition of Subbotin quadratic splines follows the conventions of [StechSub76]. Mathematical Notation in the Data Fitting Component Concept Mathematical Notation Partition of interpolation interval [a, b] , where • xi denotes breakpoints. • [xi, xi+1) denotes a sub-interval (cell) of size ?xi+1-xi . {xi}i=1,...,n, where a = x1 < x2<... b), the df?integrateex1d routine passes max(llim, b) as the left integration limit and rlim as the right integration limit to the user-defined callback function. • If the left and the right integration limits belong to the interpolation interval, the df?integrateex1d routine passes them to the user-defined callback function unchanged. The value of the integral is the sum of integral values obtained on the sub-intervals. See Also df?integrate1d/df?integrateex1d df?integrcallback df?searchcellscallback df?searchcellscallback A callback function for user-defined search to be passed into df?interpolateex1d or df? searchcellsex1d. Syntax Fortran: status = dfssearchcellscallback(n, site, cell, flag, params) status = dfdsearchcellscallback(n, site, cell, flag, params) C: status = dfsSearchCellsCallBack(n, site, cell, flag, params) status = dfdSearchCellsCallBack(n, site, cell, flag, params) Include Files • Fortran: mkl_df.f90 and mkl_df.f77 • C: mkl_df.h Input Parameters Name Type Description n Fortran: INTEGER(KIND=8) C: long long* Number of interpolation sites. site Fortran: REAL(KIND=4) DIMENSION(*) for dfssearchcellscallback Array of interpolation sites of size n. Data Fitting Functions 17 2625 Name Type Description REAL(KIND=8) DIMENSION(*) for dfdsearchcellscallback C: float* for dfsSearchCellsCallBack double* for dfdSearchCellsCallBack cell Fortran: INTEGER(KIND=8) DIMENSION(*) C: long long* Array of size n that returns indices of the cells computed by the callback function. flag Fortran: INTEGER(KIND=4) DIMENSION(*) C: int* Array of size n, with values set as follows: • If the cell with index cell[i] contains site[i], set flag[i] to 1. • Otherwise, set flag[i] to zero. In this case, the library interprets the index as an approximation and computes the index of the cell containing site[i] by using the provided index as a starting point for the search. params Fortran: INTEGER DIMENSION(*) C: void* Pointer to user-defined parameters of the callback function. Output Parameters Name Type Description status Fortran: INTEGER C: int The status returned by the callback function: • Zero indicates successful completion of the callback operation. • A negative value indicates an error. • The DF_STATUS_EXACT_RESULT status indicates that cell indices returned by the callback function are exact. In this case, you do not need to initialize entries of the flag array. • A positive value indicates a warning. See "Task Status and Error Reporting" for error code definitions. Description When passed into the df?interpolateex1d or df?searchcellsex1d routine, this function performs a user-defined search. See Also df?interpolate1d/df?interpolateex1d df?interpcallback 17 Intel® Math Kernel Library Reference Manual 2626 Task Destructors Task destructors are routines used to delete task descriptors and deallocate the corresponding memory resources. The Data Fitting task destructor dfdeletetask destroys a Data Fitting task and frees the memory. dfdeletetask Destroys a Data Fitting task object and frees the memory. Syntax Fortran: status = dfdeletetask(task) C: status = dfDeleteTask(&task) Include Files • Fortran: mkl_df.f90 and mkl_df.f77 • C: mkl_df.h Input Parameters Name Type Description task Fortran: TYPE(DF_TASK) C: DFTaskPtr Descriptor of the task to destroy. Output Parameters Name Type Description status Fortran: INTEGER C: int Status of the routine: • DF_STATUS_OK if the task is deleted successfully. • Non-zero error code if the operation failed. See "Task Status and Error Reporting" for error code definitions. Description Given a pointer to a task descriptor, this routine deletes the Data Fitting task descriptor and frees the memory allocated for the structure. If the task is deleted successfully, the routine sets the task pointer to NULL. Otherwise, the routine returns an error code. Data Fitting Functions 17 2627 17 Intel® Math Kernel Library Reference Manual 2628 Linear Solvers Basics A Many applications in science and engineering require the solution of a system of linear equations. This problem is usually expressed mathematically by the matrix-vector equation, Ax = b, where A is an m-by-n matrix, x is the n element column vector and b is the m element column vector. The matrix A is usually referred to as the coefficient matrix, and the vectors x and b are referred to as the solution vector and the right-hand side, respectively. Basic concepts related to solving linear systems with sparse matrices are described in section Sparse Linear Systems that follows. Sparse Linear Systems In many real-life applications, most of the elements in A are zero. Such a matrix is referred to as sparse. Conversely, matrices with very few zero elements are called dense. For sparse matrices, computing the solution to the equation Ax = b can be made much more efficient with respect to both storage and computation time, if the sparsity of the matrix can be exploited. The more an algorithm can exploit the sparsity without sacrificing the correctness, the better the algorithm. Generally speaking, computer software that finds solutions to systems of linear equations is called a solver. A solver designed to work specifically on sparse systems of equations is called a sparse solver. Solvers are usually classified into two groups - direct and iterative. Iterative Solvers start with an initial approximation to a solution and attempt to estimate the difference between the approximation and the true result. Based on the difference, an iterative solver calculates a new approximation that is closer to the true result than the initial approximation. This process is repeated until the difference between the approximation and the true result is sufficiently small. The main drawback to iterative solvers is that the rate of convergence depends greatly on the values in the matrix A. Consequently, it is not possible to predict how long it will take for an iterative solver to produce a solution. In fact, for illconditioned matrices, the iterative process will not converge to a solution at all. However, for wellconditioned matrices it is possible for iterative solvers to converge to a solution very quickly. Consequently for the right applications, iterative solvers can be very efficient. Direct Solvers, on the other hand, often factor the matrix A into the product of two triangular matrices and then perform a forward and backward triangular solve. This approach makes the time required to solve a systems of linear equations relatively predictable, based on the size of the matrix. In fact, for sparse matrices, the solution time can be predicted based on the number of non-zero elements in the array A. Matrix Fundamentals A matrix is a rectangular array of either real or complex numbers. A matrix is denoted by a capital letter; its elements are denoted by the same lower case letter with row/column subscripts. Thus, the value of the element in row i and column j in matrix A is denoted by a(i,j). For example, a 3 by 4 matrix A, is written as follows: 2629 Note that with the above notation, we assume the standard Fortran programming language convention of starting array indices at 1 rather than the C programming language convention of starting them at 0. A matrix in which all of the elements are real numbers is called a real matrix. A matrix that contains at least one complex number is called a complex matrix. A real or complex matrix A with the property that a(i,j) = a(j,i), is called a symmetric matrix. A complex matrix A with the property that a(i,j) = conj(a(j,i)), is called a Hermitian matrix. Note that programs that manipulate symmetric and Hermitian matrices need only store half of the matrix values, since the values of the non-stored elements can be quickly reconstructed from the stored values. A matrix that has the same number of rows as it has columns is referred to as a square matrix. The elements in a square matrix that have same row index and column index are called the diagonal elements of the matrix, or simply the diagonal of the matrix. The transpose of a matrix A is the matrix obtained by “flipping” the elements of the array about its diagonal. That is, we exchange the elements a(i,j) and a(j,i). For a complex matrix, if we both flip the elements about the diagonal and then take the complex conjugate of the element, the resulting matrix is called the Hermitian transpose or conjugate transpose of the original matrix. The transpose and Hermitian transpose of a matrix A are denoted by AT and AH respectively. A column vector, or simply a vector, is a n × 1 matrix, and a row vector is a 1 × n matrix. A real or complex matrix A is said to be positive definite if the vector-matrix product xTAx is greater than zero for all non-zero vectors x. A matrix that is not positive definite is referred to as indefinite. An upper (or lower) triangular matrix, is a square matrix in which all elements below (or above) the diagonal are zero. A unit triangular matrix is an upper or lower triangular matrix with all 1's along the diagonal. A matrix P is called a permutation matrix if, for any matrix A, the result of the matrix product PA is identical to A except for interchanging the rows of A. For a square matrix, it can be shown that if PA is a permutation of the rows of A, then APT is the same permutation of the columns of A. Additionally, it can be shown that the inverse of P is PT. In order to save space, a permutation matrix is usually stored as a linear array, called a permutation vector, rather than as an array. Specifically, if the permutation matrix maps the i-th row of a matrix to the j-th row, then the i-th element of the permutation vector is j. A matrix with non-zero elements only on the diagonal is called a diagonal matrix. As is the case with a permutation matrix, it is usually stored as a vector of values, rather than as a matrix. Direct Method For solvers that use the direct method, the basic technique employed in finding the solution of the system Ax = b is to first factor A into triangular matrices. That is, find a lower triangular matrix L and an upper triangular matrix U, such that A = LU. Having obtained such a factorization (usually referred to as an LU decomposition or LU factorization), the solution to the original problem can be rewritten as follows. Ax = b LUx = b L(Ux) = b This leads to the following two-step process for finding the solution to the original system of equations: 1. Solve the systems of equations Ly = b. 2. Solve the system Ux = y. Solving the systems Ly = b and Ux = y is referred to as a forward solve and a backward solve, respectively. If a symmetric matrix A is also positive definite, it can be shown that A can be factored as LLT where L is a lower triangular matrix. Similarly, a Hermitian matrix, A, that is positive definite can be factored as A = LLH. For both symmetric and Hermitian matrices, a factorization of this form is called a Cholesky factorization. A Intel® Math Kernel Library Reference Manual 2630 In a Cholesky factorization, the matrix U in an LU decomposition is either LT or LH. Consequently, a solver can increase its efficiency by only storing L, and one-half of A, and not computing U. Therefore, users who can express their application as the solution of a system of positive definite equations will gain a significant performance improvement over using a general representation. For matrices that are symmetric (or Hermitian) but not positive definite, there are still some significant efficiencies to be had. It can be shown that if A is symmetric but not positive definite, then A can be factored as A = LDLT, where D is a diagonal matrix and L is a lower unit triangular matrix. Similarly, if A is Hermitian, it can be factored as A = LDLH. In either case, we again only need to store L, D, and half of A and we need not compute U. However, the backward solve phases must be amended to solving LTx = D-1y rather than LTx = y. Fill-In and Reordering of Sparse Matrices Two important concepts associated with the solution of sparse systems of equations are fill-in and reordering. The following example illustrates these concepts. Consider the system of linear equation Ax = b, where A is a symmetric positive definite sparse matrix, and A and b are defined by the following: A star (*) is used to represent zeros and to emphasize the sparsity of A. The Cholesky factorization of A is: A = LLT, where L is the following: Notice that even though the matrix A is relatively sparse, the lower triangular matrix L has no zeros below the diagonal. If we computed L and then used it for the forward and backward solve phase, we would do as much computation as if A had been dense. The situation of L having non-zeros in places where A has zeros is referred to as fill-in. Computationally, it would be more efficient if a solver could exploit the non-zero structure of A in such a way as to reduce the fill-in when computing L. By doing this, the solver would only need to compute the non-zero entries in L. Toward this end, consider permuting the rows and columns of A. As described in Matrix Fundamentals section , the permutations of the rows of A can be represented as a permutation matrix, P. The result of permuting the rows is the product of P and A. Suppose, in the above example, we swap the first and fifth row Linear Solvers Basics A 2631 of A, then swap the first and fifth columns of A, and call the resulting matrix B. Mathematically, we can express the process of permuting the rows and columns of A to get B as B = PAPT. After permuting the rows and columns of A, we see that B is given by the following: Since B is obtained from A by simply switching rows and columns, the numbers of non-zero entries in A and B are the same. However, when we find the Cholesky factorization, B = LLT, we see the following: The fill-in associated with B is much smaller than the fill-in associated with A. Consequently, the storage and computation time needed to factor B is much smaller than to factor A. Based on this, we see that an efficient sparse solver needs to find permutation P of the matrix A, which minimizes the fill-in for factoring B = PAPT, and then use the factorization of B to solve the original system of equations. Although the above example is based on a symmetric positive definite matrix and a Cholesky decomposition, the same approach works for a general LU decomposition. Specifically, let P be a permutation matrix, B = PAPT and suppose that B can be factored as B = LU. Then Ax = b PA(P-1P)x = Pb PA(PTP)x = Pb (PAPT)(Px) = Pb A Intel® Math Kernel Library Reference Manual 2632 B(Px) = Pb LU(Px) = Pb It follows that if we obtain an LU factorization for B, we can solve the original system of equations by a three step process: 1. Solve Ly = Pb. 2. Solve Uz = y. 3. Set x = PTz. If we apply this three-step process to the current example, we first need to perform the forward solve of the systems of equation Ly = Pb: This gives: The second step is to perform the backward solve, Uz = y. Or, in this case, since a Cholesky factorization is used, LTz = y. Linear Solvers Basics A 2633 This gives The third and final step is to set x = PTz. This gives Sparse Matrix Storage Formats As discussed above, it is more efficient to store only the non-zero elements of a sparse matrix. There are a number of common storage formats used for sparse matrices, but most of them employ the same basic technique. That is, store all non-zero elements of the matrix into a linear array and provide auxiliary arrays to describe the locations of the non-zero elements in the original matrix. Storage Formats for the Direct Sparse Solvers The storing the non-zero elements of a sparse matrix into a linear array is done by walking down each column (column-major format) or across each row (row-major format) in order, and writing the non-zero elements to a linear array in the order they appear in the walk. For symmetric matrices, it is necessary to store only the upper triangular half of the matrix (upper triangular format) or the lower triangular half of the matrix (lower triangular format). The Intel MKL direct sparse solvers use a row-major upper triangular storage format: the matrix is compressed row-by-row and for symmetric matrices only non-zero elements in the upper triangular half of the matrix are stored. The Intel MKL sparse matrix storage format for direct sparse solvers is specified by three arrays: values, columns, and rowIndex. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix. values A real or complex array that contains the non-zero elements of a sparse matrix. The non-zero elements are mapped into the values array using the row-major upper triangular storage mapping described above. columns Element i of the integer array columns is the number of the column that contains the i-th element in the values array. rowIndex Element j of the integer array rowIndex gives the index of the element in the values array that is first non-zero element in a row j. The length of the values and columns arrays is equal to the number of non-zero elements in the matrix. As the rowIndex array gives the location of the first non-zero element within a row, and the non-zero elements are stored consecutively, the number of non-zero elements in the i-th row is equal to the difference of rowIndex(i) and rowIndex(i+1). To have this relationship hold for the last row of the matrix, an additional entry (dummy entry) is added to the end of rowIndex. Its value is equal to the number of non-zero elements plus one. This makes the total length of the rowIndex array one larger than the number of rows in the matrix. NOTE The Intel MKL sparse storage scheme for the direct sparse solvers supports both with onebased indexing and zero-based indexing. Consider the symmetric matrix A: A Intel® Math Kernel Library Reference Manual 2634 Only elements from the upper triangle are stored. The actual arrays for the matrix A are as follows: Storage Arrays for a Symmetric Matrix one-based indexing values = (1 -1 -3 5 4 6 4 7 -5) columns = (1 2 4 2 3 4 5 4 5) rowIndex = (1 4 5 8 9 10) zero-based indexing values = (1 -1 -3 5 4 6 4 7 -5) columns = (0 1 3 1 2 3 4 3 4) rowIndex = (0 3 4 7 8 9) For a non-symmetric or non-Hermitian matrix, all non-zero elements need to be stored. Consider the nonsymmetric matrix B: The matrix B has 13 non-zero elements, and all of them are stored as follows: Storage Arrays for a Non-Symmetric Matrix one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) rowIndex = (1 4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) rowIndex = (0 3 5 8 11 13) Direct sparse solvers can also solve symmetrically structured systems of equations. A symmetrically structured system of equations is one where the pattern of non-zero elements is symmetric. That is, a matrix has a symmetric structure if a(j,i) is not zero if and only if a(j, i) is not zero. From the point of view of the solver software, a "non-zero" element of a matrix is any element stored in the values array, even if its value Linear Solvers Basics A 2635 is equal to 0. In that sense, any non-symmetric matrix can be turned into a symmetrically structured matrix by carefully adding zeros to the values array. For example, the above matrix B can be turned into a symmetrically structured matrix by adding two non-zero entries: The matrix B can be considered to be symmetrically structured with 15 non-zero elements and represented as: Storage Arrays for a Symmetrically Structured Matrix one-based indexing values = (1 -1 -3 -2 5 0 4 6 4 -4 2 7 8 0 -5) columns = (1 2 4 1 2 5 3 4 5 1 3 4 2 3 5) rowIndex = (1 4 7 10 13 16) zero-based indexing values = (1 -1 -3 -2 5 0 4 6 4 -4 2 7 8 0 -5) columns = (0 1 3 0 1 4 2 3 4 0 2 3 1 2 4) rowIndex = (0 3 6 9 12 15) Storage Format Restrictions The storage format for the sparse solver must conform to two important restrictions: - the non-zero values in a given row must be placed into the values array in the order in which they occur in the row (from left to right); - no diagonal element can be omitted from the values array for any symmetric or structurally symmetric matrix. The second restriction implies that if symmetric or structurally symmetric matrices have zero diagonal elements, then they must be explicitly represented in the values array. Sparse Matrix Storage Formats for Sparse BLAS Levels 2 and Level 3 This section describes in detail the sparse matrix storage formats supported in the current version of the Intel MKL Sparse BLAS Level 2 and Level 3. CSR Format The Intel MKL compressed sparse row (CSR) format is specified by four arrays: the values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of A are mapped into the values array using the row-major storage mapping described above. columns Element i of the integer array columns is the number of the column in A that contains the i-th value in the values array. A Intel® Math Kernel Library Reference Manual 2636 pointerB Element j of this integer array gives the index of the element in the values array that is first non-zero element in a row j of A. Note that this index is equal to pointerB(j) - pointerB(1)+1 . pointerE An integer array that contains row indices, such that pointerE(j)- pointerB(1) is the index of the element in the values array that is last nonzero element in a row j of A. The length of the values and columns arrays is equal to the number of non-zero elements in A.The length of the pointerB and pointerE arrays is equal to the number of rows in A. NOTE Note that the Intel MKL Sparse BLAS routines support the CSR format both with one-based indexing and zero-based indexing. The matrix B can be represented in the CSR format as: Storage Arrays for a Matrix in CSR Format one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) pointerB = (1 4 6 9 12) pointerE = (4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) pointerB = (0 3 5 8 11) pointerE = (3 5 8 11 13) This storage format is used in the NIST Sparse BLAS library [Rem05]. Note that the storage format accepted for the direct sparse solvers and described above (see Storage Formats for the Direct Sparse Solvers) is a variation of the CSR format. It also is used in the Intel MKL Sparse BLAS Level 2 both with one-based indexing and zero-based indexing. The above matrix B can be represented in this format (referred to as the 3-array variation of the CSR format) as: Storage Arrays for a Matrix in CSR Format (3-Array Variation) one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) columns = (1 2 4 1 2 3 4 5 1 3 4 2 5) rowIndex = (1 4 6 9 12 14) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) Linear Solvers Basics A 2637 columns = (0 1 3 0 1 2 3 4 0 2 3 1 4) rowIndex = (0 3 5 8 11 13) The 3-array variation of the CSR format has a restriction: all non-zero elements are stored continuously, that is the set of non-zero elements in the row J goes just after the set of non-zero elements in the row J-1 . There are no such restrictions in the general (NIST) CSR format. This may be useful, for example, if there is a need to operate with different submatrices of the matrix at the same time. In this case, it is enough to define the arrays pointerB and pointerE for each needed submatrix so that all these arrays are pointers to the same array values. Comparing the array rowIndex from the Table "Storage Arrays for a Non-Symmetric Example Matrix" with the arrays pointerB and pointerE from the Table "Storage Arrays for an Example Matrix in CSR Format" it is easy to see that pointerB(i) = rowIndex(i) for i=1, ..5; pointerE(i) = rowIndex(i+1) for i=1, ..5. This enables calling a routine that has values, columns, pointerB and pointerE as input parameters for a sparse matrix stored in the format accepted for the direct sparse solvers. For example, a routine with the interface: Subroutine name_routine(.... , values, columns, pointerB, pointerE, ...) can be called with parameters values, columns, rowIndex as follows: call name_routine(.... , values, columns, rowIndex, rowindex(2), ...). CSC Format The compressed sparse column format (CSC) is similar to the CSR format, but the columns are used instead the rows. In other words, the CSC format is identical to the CSR format for the transposed matrix. The CSR format is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of A are mapped into the values array using the columnmajor storage mapping. rows Element i of the integer array rows is the number of the row in A that contains the i-th value in the values array. pointerB Element j of this integer array gives the index of the element in the values array that is first non-zero element in a column j of A. Note that this index is equal to pointerB(j) - pointerB(1)+1 . pointerE An integer array that contains column indices, such that pointerE(j)- pointerB(1) is the index of the element in the values array that is last nonzero element in a column j of A. The length of the values and columns arrays is equal to the number of non-zero elements in A.The length of the pointerB and pointerE arrays is equal to the number of columns in A. NOTE Note that the Intel MKL Sparse BLAS routines support the CSC format both with one-based indexing and zero-based indexing. The above matrix B can be represented in the CSC format as: Storage Arrays for a Matrix in CSC Format one-based indexing A Intel® Math Kernel Library Reference Manual 2638 values = (1 -2 -4 -1 5 8 4 2 -3 6 7 4 -5) rows = (1 2 4 1 2 5 3 4 1 3 4 2 5) pointerB = (1 4 7 9 12) pointerE = (4 7 9 12 14) zero-based indexing values = (1 -2 -4 -1 5 8 4 2 -3 6 7 4 -5) rows = (0 1 3 0 1 4 2 3 0 2 3 1 4) pointerB = (0 3 6 8 11) pointerE = (3 6 8 11 13) Coordinate Format The coordinate format is the most flexible and simplest format for the sparse matrix representation. Only non-zero elements are stored, and the coordinates of each non-zero element are given explicitly. Many commercial libraries support the matrix-vector multiplication for the sparse matrices in the coordinate format. The Intel MKL coordinate format is specified by three arrays: values, rows, and column, and a parameter nnz which is number of non-zero elements in A. All three arrays have dimension nnz. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrix A. values A real or complex array that contains the non-zero elements of A in any order. rows Element i of the integer array rows is the number of the row in A that contains the i-th value in the values array. columns Element i of the integer array columns is the number of the column in A that contains the i-th value in the values array. NOTE Note that the Intel MKL Sparse BLAS routines support the coordinate format both with onebased indexing and zero-based indexing. For example, the sparse matrix C can be represented in the coordinate format as follows: Storage Arrays for an Example Matrix in case of the coordinate format one-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) rows = (1 1 1 2 2 3 3 3 4 4 4 5 5) columns = (1 2 3 1 2 3 4 5 1 3 4 2 5) zero-based indexing values = (1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5) rows = (0 0 0 1 1 2 2 2 3 3 3 4 4) Linear Solvers Basics A 2639 columns = (0 1 2 0 1 2 3 4 0 2 3 1 4) Diagonal Storage Format If the sparse matrix has diagonals containing only zero elements, then the diagonal storage format can be used to reduce the amount of information needed to locate the non-zero elements. This storage format is particularly useful in many applications where the matrix arises from a finite element or finite difference discretization. The Intel MKL diagonal storage format is specified by two arrays: values and distance, and two parameters: ndiag, which is the number of non-empty diagonals, and lval, which is the declared leading dimension in the calling (sub)programs. The following table describes the arrays values and distance: values A real or complex two-dimensional array is dimensioned as lval by ndiag. Each column of it contains the non-zero elements of certain diagonal of A. The key point of the storage is that each element in values retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. Note that the value of distance(i) is the number of elements to be padded for diagonal i. distance An integer array with dimension ndiag. Element i of the array distance is the distance between i-diagonal and the main diagonal. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero. The above matrix C can be represented in the diagonal storage format as follows: where the asterisks denote padded elements. When storing symmetric, Hermitian, or skew-symmetric matrices, it is necessary to store only the upper or the lower triangular part of the matrix. For the Intel MKL triangular solver routines elements of the array distance must be sorted in increasing order. In all other cases the diagonals and distances can be stored in arbitrary order. Skyline Storage Format The skyline storage format is important for the direct sparse solvers, and it is well suited for Cholesky or LU decomposition when no pivoting is required. The skyline storage format accepted in Intel MKL can store only triangular matrix or triangular part of a matrix. This format is specified by two arrays: values and pointers. The following table describes these arrays: A Intel® Math Kernel Library Reference Manual 2640 values A scalar array. For a lower triangular matrix it contains the set of elements from each row of the matrix starting from the first non-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero element down to and including the diagonal element. Encountered zero elements are included in the sets. pointers An integer array with dimension (m+1), where m is the number of rows for lower triangle (columns for the upper triangle). pointers(i) - pointers(1)+1 gives the index of element in values that is first non-zero element in row (column) i. The value of pointers(m+1) is set to nnz+pointers(1), where nnz is the number of elements in the array values. For example, the low triangle of the matrix C given above can be stored as follows: values = ( 1 -2 5 4 -4 0 2 7 8 0 0 -5 ) pointers = ( 1 2 4 5 9 13 ) and the upper triangle of this matrix C can be stored as follows: values = ( 1 -1 5 -3 0 4 6 7 4 0 -5 ) pointers = ( 1 2 4 7 9 12 ) This storage format is supported by the NIST Sparse BLAS library [Rem05]. Note that the Intel MKL Sparse BLAS routines operating with the skyline storage format does not support general matrices. BSR Format The Intel MKL block compressed sparse row (BSR) format for sparse matrices is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes these arrays. values A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block-by-block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block elements are stored in column-major order in the case of one-based indexing, and in row-major order in the case of the zero-based indexing. columns Element i of the integer array columns is the number of the column in the block matrix that contains the i-th non-zero block. pointerB Element j of this integer array gives the index of the element in the columns array that is first non-zero block in a row j of the block matrix. pointerE Element j of this integer array gives the index of the element in the columns array that contains the last non-zero block in a row j of the block matrix plus 1. The length of the values array is equal to the number of all elements in the non-zero blocks, the length of the columns array is equal to the number of non-zero blocks. The length of the pointerB and pointerE arrays is equal to the number of block rows in the block matrix. NOTE Note that the Intel MKL Sparse BLAS routines support BSR format both with one-based indexing and zero-based indexing. For example, consider the sparse matrix D Linear Solvers Basics A 2641 If the size of the block equals 2, then the sparse matrix D can be represented as a 3x3 block matrix E with the following structure: where The matrix D can be represented in the BSR format as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 1 4 0 3 0 7 0 2 0) columns = (1 2 2 2 3) pointerB = (1 3 4) pointerE = (3 4 6) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0) columns = (0 1 1 1 2) pointerB = (0 2 3) pointerE = (2 3 5) This storage format is supported by the NIST Sparse BLAS library [Rem05]. Intel MKL supports the variation of the BSR format that is specified by three arrays: values, columns, and rowIndex. The following table describes these arrays. A Intel® Math Kernel Library Reference Manual 2642 values A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block by block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block the elements are stored in column major order in the case of the onebased indexing, and in row major order in the case of the zero-based indexing. columns Element i of the integer array columns is the number of the column in the block matrix that contains the i-th non-zero block. rowIndex Element j of this integer array gives the index of the element in the columns array that is first non-zero block in a row j of the block matrix. The length of the values array is equal to the number of all elements in the non-zero blocks, the length of the columns array is equal to the number of non-zero blocks. As the rowIndex array gives the location of the first non-zero block within a row, and the non-zero blocks are stored consecutively, the number of non-zero blocks in the i-th row is equal to the difference of rowIndex(i) and rowIndex(i+1). To retain this relationship for the last row of the block matrix, an additional entry (dummy entry) is added to the end of rowIndex with value equal to the number of non-zeros blocks plus one. This makes the total length of the rowIndex array one larger than the number of rows of the block matrix. The above matrix D can be represented in this 3-array variation of the BSR format as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 2 4 0 3 0 7 0 2 0) columns = (1 2 2 2 3) rowIndex = (1 3 4 6) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0) columns = (0 1 1 1 2) rowIndex = (0 2 3 5) When storing symmetric matrices, it is necessary to store only the upper or the lower triangular part of the matrix. For example, consider the symmetric sparse matrix F: Linear Solvers Basics A 2643 If the size of the block equals 2, then the sparse matrix F can be represented as a 3x3 block matrix G with the following structure: where The symmetric matrix F can be represented in this 3-array variation of the BSR format (storing only upper triangular) as follows: one-based indexing values = (1 2 0 1 6 8 7 2 1 5 4 2 7 0 2 0) columns = (1 2 2 3) rowIndex = (1 3 4 5) zero-based indexing values = (1 0 2 1 6 7 8 2 1 4 5 2 7 2 0 0) columns = (0 1 1 2) rowIndex = (0 2 3 4) A Intel® Math Kernel Library Reference Manual 2644 Routine and Function Arguments B The major arguments in the BLAS routines are vector and matrix, whereas VML functions work on vector arguments only. The sections that follow discuss each of these arguments and provide examples. Vector Arguments in BLAS Vector arguments are passed in one-dimensional arrays. The array dimension (length) and vector increment are passed as integer variables. The length determines the number of elements in the vector. The increment (also called stride) determines the spacing between vector elements and the order of the elements in the array in which the vector is passed. A vector of length n and increment incx is passed in a one-dimensional array x whose values are defined as x(1), x(1+|incx|), ..., x(1+(n-1)* |incx|) If incx is positive, then the elements in array x are stored in increasing order. If incx is negative, the elements in array x are stored in decreasing order with the first element defined as x(1+(n-1)* |incx|). If incx is zero, then all elements of the vector have the same value, x(1). The dimension of the onedimensional array that stores the vector must always be at least idimx = 1 + (n-1)* |incx | Example. One-dimensional Real Array Let x(1:7) be the one-dimensional real array x = (1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0). If incx =2 and n = 3, then the vector argument with elements in order from first to last is (1.0, 5.0, 9.0). If incx = -2 and n = 4, then the vector elements in order from first to last is (13.0, 9.0, 5.0, 1.0). If incx = 0 and n = 4, then the vector elements in order from first to last is (1.0, 1.0, 1.0, 1.0). One-dimensional substructures of a matrix, such as the rows, columns, and diagonals, can be passed as vector arguments with the starting address and increment specified. In Fortran, storing the m-by-n matrix is based on column-major ordering where the increment between elements in the same column is 1, the increment between elements in the same row is m, and the increment between elements on the same diagonal is m + 1. Example. Two-dimensional Real Matrix Let a be the real 5 x 4 matrix declared as REAL A (5,4). To scale the third column of a by 2.0, use the BLAS routine sscal with the following calling sequence: callsscal (5, 2.0, a(1,3), 1) To scale the second row, use the statement: callsscal (4, 2.0, a(2,1), 5) To scale the main diagonal of A by 2.0, use the statement: callsscal (5, 2.0, a(1,1), 6) 2645 NOTE The default vector argument is assumed to be 1. Vector Arguments in VML Vector arguments of VML mathematical functions are passed in one-dimensional arrays with unit vector increment. It means that a vector of length n is passed contiguously in an array a whose values are defined as a[0], a[1], ..., a[n-1] (for the C interface). To accommodate for arrays with other increments, or more complicated indexing, VML contains auxiliary pack/unpack functions that gather the array elements into a contiguous vector and then scatter them after the computation is complete. Generally, if the vector elements are stored in a one-dimensional array a as a[m0], a[m1], ..., a[mn-1] and need to be regrouped into an array y as y[k0], y[k1], ..., y[kn-1], VML pack/unpack functions can use one of the following indexing methods: Positive Increment Indexing kj = incy * j, mj = inca * j, j = 0 ,..., n-1 Constraint: incy > 0 and inca > 0. For example, setting incy = 1 specifies gathering array elements into a contiguous vector. This method is similar to that used in BLAS, with the exception that negative and zero increments are not permitted. Index Vector Indexing kj = iy[j], mj = ia[j], j = 0 ,..., n-1, where ia and iy are arrays of length n that contain index vectors for the input and output arrays a and y, respectively. Mask Vector Indexing Indices kj , mj are such that: my[kj] ? 0, ma[mj] ? 0 , j = 0,..., n-1, where ma and my are arrays that contain mask vectors for the input and output arrays a and y, respectively. Matrix Arguments Matrix arguments of the Intel® Math Kernel Library routines can be stored in either one- or two-dimensional arrays, using the following storage schemes: • conventional full storage (in a two-dimensional array) • packed storage for Hermitian, symmetric, or triangular matrices (in a one-dimensional array) • band storage for band matrices (in a two-dimensional array) • rectangular full packed storage for symmetric, Hermitian, or triangular matrices as compact as the Packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels. Full storage is the following obvious scheme: a matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). B Intel® Math Kernel Library Reference Manual 2646 If a matrix is triangular (upper or lower, as specified by the argument uplo), only the elements of the relevant triangle are stored; the remaining elements of the array need not be set. Routines that handle symmetric or Hermitian matrices allow for either the upper or lower triangle of the matrix to be stored in the corresponding elements of the array: if uplo ='U', aij is stored in a(i,j) for i = j, other elements of a need not be set. if uplo ='L', aij is stored in a(i,j) for j = i, other elements of a need not be set. Packed storage allows you to store symmetric, Hermitian, or triangular matrices more compactly: the relevant triangle (again, as specified by the argument uplo) is packed by columns in a one-dimensional array ap: if uplo ='U', aij is stored in ap(i+j(j-1)/2) for i = j if uplo ='L', aij is stored in ap(i+(2*n-j)*(j-1)/2) for j = i. In descriptions of LAPACK routines, arrays with packed matrices have names ending in p. Band storage is as follows: an m-by-n band matrix with kl non-zero sub-diagonals and ku non-zero superdiagonals is stored compactly in a two-dimensional array ab with kl+ku+1 rows and n columns. Columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. Thus, aij is stored in ab(ku+1+i-j,j) for max(1,j-ku) = i = min(n,j+kl). Use the band storage scheme only when kl and ku are much less than the matrix size n. Although the routines work correctly for all values of kl and ku, using the band storage is inefficient if your matrices are not really banded. The band storage scheme is illustrated by the following example, when m = n = 6, kl = 2, ku = 1 Array elements marked * are not used by the routines: When a general band matrix is supplied for LU factorization, space must be allowed to store kl additional super-diagonals generated by fill-in as a result of row interchanges. This means that the matrix is stored according to the above scheme, but with kl + ku super-diagonals. Thus, aij is stored in ab(kl+ku+1+i-j,j) for max(1,j-ku) = i = min(n,j+kl). The band storage scheme for LU factorization is illustrated by the following example, whenm = n = 6, kl = 2, ku = 1: Routine and Function Arguments B 2647 Array elements marked * are not used by the routines; elements marked + need not be set on entry, but are required by the LU factorization routines to store the results. The input array will be overwritten on exit by the details of the LU factorization as follows: where uij are the elements of the upper triangular matrix U, and mij are the multipliers used during factorization. Triangular band matrices are stored in the same format, with either kl= 0 if upper triangular, or ku = 0 if lower triangular. For symmetric or Hermitian band matrices with k sub-diagonals or super-diagonals, you need to store only the upper or lower triangle, as specified by the argument uplo: if uplo ='U', aij is stored in ab(k+1+i-j,j) for max(1,j-k) = i = j if uplo ='L', aij is stored in ab(1+i-j,j) for j = i = min(n,j+k). In descriptions of LAPACK routines, arrays that hold matrices in band storage have names ending in b. In Fortran, column-major ordering of storage is assumed. This means that elements of the same column occupy successive storage locations. Three quantities are usually associated with a two-dimensional array argument: its leading dimension, which specifies the number of storage locations between elements in the same row, its number of rows, and its number of columns. For a matrix in full storage, the leading dimension of the array must be at least as large as the number of rows in the matrix. A character transposition parameter is often passed to indicate whether the matrix argument is to be used in normal or transposed form or, for a complex matrix, if the conjugate transpose of the matrix is to be used. The values of the transposition parameter for these three cases are the following: 'N' or 'n' normal (no conjugation, no transposition) 'T' or 't' transpose 'C' or 'c' conjugate transpose. B Intel® Math Kernel Library Reference Manual 2648 Example. Two-Dimensional Complex Array Suppose A (1:5, 1:4) is the complex two-dimensional array presented by matrix Let transa be the transposition parameter, m be the number of rows, n be the number of columns, and lda be the leading dimension. Then if transa = 'N', m = 4, n = 2, and lda = 5, the matrix argument would be If transa = 'T', m = 4, n = 2, and lda =5, the matrix argument would be If transa = 'C', m = 4, n = 2, and lda =5, the matrix argument would be Note that care should be taken when using a leading dimension value which is different from the number of rows specified in the declaration of the two-dimensional array. For example, suppose the array A above is declared as COMPLEX A (5,4). Then if transa = 'N', m = 3, n = 4, and lda = 4, the matrix argument will be Routine and Function Arguments B 2649 Rectangular Full Packed storage allows you to store symmetric, Hermitian, or triangular matrices as compact as the Packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels. To store an n-by-n triangle (and suppose for simplicity that n is even), you partition the triangle into three parts: two n/2-by-n/2 triangles and an n/2-by-n/2 square, then pack this as an n-by-n/2 rectangle (or n/2-by-n rectangle), by transposing (or transpose-conjugating) one of the triangles and packing it next to the other triangle. Since the two triangles are stored in full storage, you can use existing efficient routines on them. There are eight cases of RFP storage representation: when n is even or odd, the packed matrix is transposed or not, the triangular matrix is lower or upper. See below for all the eight storage schemes illustrated: n is odd, A is lower triangular n is even, A is lower triangular n is odd, A is upper triangular n is even, A is upper triangular B Intel® Math Kernel Library Reference Manual 2650 Intel MKL provides a number of routines such as ?hfrk, ?sfrk performing BLAS operations working directly on RFP matrices, as well as some conversion routines, for instance, ?tpttf goes from the standard packed format to RFP and ?trttf goes from the full format to RFP. Please refer to the Netlib site for more information. Note that in the descriptions of LAPACK routines, arrays with RFP matrices have names ending in fp. Routine and Function Arguments B 2651 B Intel® Math Kernel Library Reference Manual 2652 Code Examples C This appendix presents code examples of using some Intel MKL routines and functions. You can find here example code written in both Fortran and C. Please refer to respective chapters in the manual for detailed descriptions of function parameters and operation. BLAS Code Examples Example. Using BLAS Level 1 Function The following example illustrates a call to the BLAS Level 1 function sdot. This function performs a vectorvector operation of computing a scalar product of two single-precision real vectors x and y. Parameters n Specifies the number of elements in vectors x and y. incx Specifies the increment for the elements of x. incy Specifies the increment for the elements of y. program dot_main real x(10), y(10), sdot, res integer n, incx, incy, i external sdot n = 5 incx = 2 incy = 1 do i = 1, 10 x(i) = 2.0e0 y(i) = 1.0e0 end do res = sdot (n, x, incx, y, incy) print*, `SDOT = `, res end As a result of this program execution, the following line is printed: SDOT = 10.000 Example. Using BLAS Level 1 Routine The following example illustrates a call to the BLAS Level 1 routine scopy. This routine performs a vectorvector operation of copying a single-precision real vector x to a vector y. Parameters n Specifies the number of elements in vectors x and y. incx Specifies the increment for the elements of x. incy Specifies the increment for the elements of y. program copy_main real x(10), y(10) integer n, incx, incy, i n = 3 2653 incx = 3 incy = 1 do i = 1, 10 x(i) = i end do call scopy (n, x, incx, y, incy) print*, `Y = `, (y(i), i = 1, n) end As a result of this program execution, the following line is printed: Y = 1.00000 4.00000 7.00000 Example. Using BLAS Level 2 Routine The following example illustrates a call to the BLAS Level 2 routine sger. This routine performs a matrixvector operation a := alpha*x*y' + a. Parameters alpha Specifies a scalar alpha. x m-element vector. y n-element vector. a m-by-n matrix. program ger_main real a(5,3), x(10), y(10), alpha integer m, n, incx, incy, i, j, lda m = 2 n = 3 lda = 5 incx = 2 incy = 1 alpha = 0.5 do i = 1, 10 x(i) = 1.0 y(i) = 1.0 end do do i = 1, m do j = 1, n a(i,j) = j end do end do call sger (m, n, alpha, x, incx, y, incy, a, lda) print*, `Matrix A: ` do i = 1, m print*, (a(i,j), j = 1, n) end do end As a result of this program execution, matrix a is printed as follows: Matrix A: 1.50000 2.50000 3.50000 1.50000 2.50000 3.50000 Example. Using BLAS Level 3 Routine The following example illustrates a call to the BLAS Level 3 routine ssymm. This routine performs a matrixmatrix operation c := alpha*a*b' + beta*c. C Intel® Math Kernel Library Reference Manual 2654 Parameters alpha Specifies a scalar alpha. beta Specifies a scalar beta. a Symmetric matrix b m-by-n matrix c m-by-n matrix program symm_main real a(3,3), b(3,2), c(3,3), alpha, beta integer m, n, lda, ldb, ldc, i, j character uplo, side uplo = 'u' side = 'l' m = 3 n = 2 lda = 3 ldb = 3 ldc = 3 alpha = 0.5 beta = 2.0 do i = 1, m do j = 1, m a(i,j) = 1.0 end do end do do i = 1, m do j = 1, n c(i,j) = 1.0 b(i,j) = 2.0 end do end do call ssymm (side, uplo, m, n, alpha, a, lda, b, ldb, beta, c, ldc) print*, `Matrix C: ` do i = 1, m print*, (c(i,j), j = 1, n) end do end As a result of this program execution, matrix c is printed as follows: Matrix C: 5.00000 5.00000 5.00000 5.00000 5.00000 5.00000 The following example illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. Example. Calling a Complex BLAS Level 1 Function from C In this example, the complex dot product is returned in the structure c. #include #include "mkl_blas.h" #define N 5 void main() { int n, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; void zdotc(); n = N; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } Code Examples C 2655 zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f )\n", c.real, c.imag ); } NOTE Instead of calling BLAS directly from C programs, you might wish to use the CBLAS interface; this is the supported way of calling BLAS from C. For more information about CBLAS, see Appendix D , which presents CBLAS, the C interface to the Basic Linear Algebra Subprograms (BLAS) implemented in Intel® MKL. Fourier Transform Functions Code Examples This section presents code examples of functions described in the “FFT Functions” and “Cluster FFT Functions” sections in the “Fourier Transform Functions” chapter. The examples are grouped in subsections • Examples for FFT Functions, including Examples of Using Multi-Threading for FFT Computation • Examples for Cluster FFT Functions • Auxiliary data transformations. FFT Code Examples This section presents code examples of using the FFT interface functions described in “Fourier Transform Functions” chapter. Here are the examples of two one-dimensional computations. These examples use the default settings for all of the configuration parameters, which are specified in “Configuration Settings”. One-dimensional In-place FFT (Fortran Interface) ! Fortran example. ! 1D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X(32) Real :: Y(34) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle Integer :: Status !...put input data into X(1),...,X(32); Y(1),...,Y(32) ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE,& DFTI_COMPLEX, 1, 32 ) Status = DftiCommitDescriptor( My_Desc1_Handle ) Status = DftiComputeForward( My_Desc1_Handle, X ) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by {X(1),X(2),...,X(32)} ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor(My_Desc2_Handle, DFTI_SINGLE,& DFTI_REAL, 1, 32) Status = DftiCommitDescriptor(My_Desc2_Handle) Status = DftiComputeForward(My_Desc2_Handle, Y) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given in CCS format. One-dimensional Out-of-place FFT (Fortran Interface) ! Fortran example. ! 1D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X_in(32) Complex :: X_out(32) Real :: Y_in(32) Real :: Y_out(34) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle C Intel® Math Kernel Library Reference Manual 2656 Integer :: Status ...put input data into X_in(1),...,X_in(32); Y_in(1),...,Y_in(32) ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32 ) Status = DftiSetValue( My_Desc1_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor( My_Desc1_Handle ) Status = DftiComputeForward( My_Desc1_Handle, X_in, X_out ) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by {X_out(1),X_out(2),...,X_out(32)} ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor(My_Desc2_Handle, DFTI_SINGLE, DFTI_REAL, 1, 32) Status = DftiSetValue( My_Desc2_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor(My_Desc2_Handle) Status = DftiComputeForward(My_Desc2_Handle, Y_in, Y_out) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given by Y_out in CCS format. One-dimensional In-place FFT (C Interface) /* C example, float _Complex is defined in C9X */ #include "mkl_dfti.h" float _Complex x[32]; float y[34]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status; //...put input data into x[0],...,x[31]; y[0],...,y[31] status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiCommitDescriptor( my_desc1_handle ); status = DftiComputeForward( my_desc1_handle, x); status = DftiFreeDescriptor(&my_desc1_handle); /* result is x[0], ..., x[31]*/ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 1, 32); status = DftiCommitDescriptor( my_desc2_handle); status = DftiComputeForward( my_desc2_handle, y); status = DftiFreeDescriptor(&my_desc2_handle); /* result is given in CCS format*/ One-dimensional Out-of-place FFT (C Interface) /* C example, float _Complex is defined in C9X */ #include "mkl_dfti.h" float _Complex x_in[32]; float _Complex x_out[32]; float y_in[32]; float y_out[34]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status; //...put input data into x_in[0],...,x_in[31]; y_in[0],...,y_in[31] status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiSetValue( my_desc1_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc1_handle ); status = DftiComputeForward( my_desc1_handle, x_in, x_out); status = DftiFreeDescriptor(&my_desc1_handle); /* result is x_out[0], ..., x_out[31]*/ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 1, 32); Status = DftiSetValue( My_Desc2_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc2_handle); Code Examples C 2657 status = DftiComputeForward( my_desc2_handle, y_in, y_out); status = DftiFreeDescriptor(&my_desc2_handle); /* result is given by y_out in CCS format*/ Two-dimensional FFT (Fortran Interface) The following is an example of two simple two-dimensional transforms. Notice that the data and result parameters in computation functions are all declared as assumed-size rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to one-dimensional array by EQUIVALENCE statement or other facilities of Fortran. ! Fortran example. ! 2D complex to complex, and real to conjugate-even Use MKL_DFTI Complex :: X_2D(32,100) Real :: Y_2D(34, 102) Complex :: X(3200) Real :: Y(3468) Equivalence (X_2D, X) Equivalence (Y_2D, Y) type(DFTI_DESCRIPTOR), POINTER :: My_Desc1_Handle, My_Desc2_Handle Integer :: Status, L(2) !...put input data into X_2D(j,k), Y_2D(j,k), 1<=j=32,1<=k<=100 !...set L(1) = 32, L(2) = 100 !...the transform is a 32-by-100 ! Perform a complex to complex transform Status = DftiCreateDescriptor( My_Desc1_Handle, DFTI_SINGLE,& DFTI_COMPLEX, 2, L) Status = DftiCommitDescriptor( My_Desc1_Handle) Status = DftiComputeForward( My_Desc1_Handle, X) Status = DftiFreeDescriptor(My_Desc1_Handle) ! result is given by X_2D(j,k), 1<=j<=32, 1<=k<=100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc2_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L) Status = DftiCommitDescriptor( My_Desc2_Handle) Status = DftiComputeForward( My_Desc2_Handle, Y) Status = DftiFreeDescriptor(My_Desc2_Handle) ! result is given by the complex value z(j,k) 1<=j<=32; 1<=k<=100 ! and is stored in CCS format Two-dimensional FFT (C Interface) /* C99 example */ #include "mkl_dfti.h" float _Complex x[32][100]; float y[34][102]; DFTI_DESCRIPTOR_HANDLE my_desc1_handle; DFTI_DESCRIPTOR_HANDLE my_desc2_handle; MKL_LONG status, l[2]; //...put input data into x[j][k] 0<=j<=31, 0<=k<=99 //...put input data into y[j][k] 0<=j<=31, 0<=k<=99 l[0] = 32; l[1] = 100; status = DftiCreateDescriptor( &my_desc1_handle, DFTI_SINGLE, DFTI_COMPLEX, 2, l); status = DftiCommitDescriptor( my_desc1_handle); status = DftiComputeForward( my_desc1_handle, x); status = DftiFreeDescriptor(&my_desc1_handle); /* result is the complex value x[j][k], 0<=j<=31, 0<=k<=99 */ status = DftiCreateDescriptor( &my_desc2_handle, DFTI_SINGLE, DFTI_REAL, 2, l); status = DftiCommitDescriptor( my_desc2_handle); status = DftiComputeForward( my_desc2_handle, y); C Intel® Math Kernel Library Reference Manual 2658 status = DftiFreeDescriptor(&my_desc2_handle); /* result is the complex value z(j,k) 0<=j<=31; 0<=k<=99 /* and is stored in CCS format*/ The following examples demonstrate how you can change the default configuration settings by using the DftiSetValue function. For instance, to preserve the input data after the FFT computation, the configuration of the DFTI_PLACEMENT should be changed to "not in place" from the default choice of "in place." Changing Default Settings (Fortran) The code below illustrates how this can be done: ! Fortran example ! 1D complex to complex, not in place Use MKL_DFTI Complex :: X_in(32), X_out(32) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status !...put input data into X_in(j), 1<=j<=32 Status = DftiCreateDescriptor( My_Desc_Handle,& DFTI_SINGLE, DFTI_COMPLEX, 1, 32) Status = DftiSetValue( My_Desc_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE) Status = DftiCommitDescriptor( My_Desc_Handle) Status = DftiComputeForward( My_Desc_Handle, X_in, X_out) Status = DftiFreeDescriptor (My_Desc_Handle) ! result is X_out(1),X_out(2),...,X_out(32) Changing Default Settings (C) /* C99 example */ #include "mkl_dfti.h" float _Complex x_in[32], x_out[32]; DFTI_DESCRIPTOR_HANDLE my_desc_handle; MKL_LONG status; //...put input data into x_in[j], 0 <= j < 32 status = DftiCreateDescriptor( &my_desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, 32); status = DftiSetValue( my_desc_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiCommitDescriptor( my_desc_handle); status = DftiComputeForward( my_desc_handle, x_in, x_out); status = DftiFreeDescriptor(&my_desc_handle); /* result is x_out[0], x_out[1], ..., x_out[31] */ Using Status Checking Functions The example illustrates the use of status checking functions described in Chapter 11. /* C */ DFTI_DESCRIPTOR_HANDLE desc; MKL_LONG status; // . . . descriptor creation and other code status = DftiCommitDescriptor(desc); if (status && !DftiErrorClass(status,DFTI_NO_ERROR)) { printf ('Error: %s\n', DftiErrorMessage(status)); } ! Fortran type(DFTI_DESCRIPTOR), POINTER :: desc integer status ! ...descriptor creation and other code status = DftiCommitDescriptor(desc) Code Examples C 2659 if (status .ne. 0) then if (.not. DftiErrorClass(status,DFTI_NO_ERROR) then print *, 'Error: ‘, DftiErrorMessage(status) endif endif Computing 2D FFT by One-Dimensional Transforms Below is an example where a 20-by-40 two-dimensional FFT is computed explicitly using one-dimensional transforms. Notice that the data and result parameters in computation functions are all declared as assumedsize rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to onedimensional array by EQUIVALENCE statement or other facilities of Fortran. ! Fortran use mkl_dfti Complex :: X_2D(20,40) Complex :: X(800) Equivalence (X_2D, X) INTEGER :: STRIDE(2) type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_Dim1 type(DFTI_DESCRIPTOR), POINTER :: Desc_Handle_Dim2 ! ... Status = DftiCreateDescriptor(Desc_Handle_Dim1, DFTI_SINGLE,& DFTI_COMPLEX, 1, 20 ) Status = DftiCreateDescriptor(Desc_Handle_Dim2, DFTI_SINGLE,& DFTI_COMPLEX, 1, 40 ) ! perform 40 one-dimensional transforms along 1st dimension Status = DftiSetValue( Desc_Handle_Dim1, DFTI_NUMBER_OF_TRANSFORMS, 40 ) Status = DftiSetValue( Desc_Handle_Dim1, DFTI_INPUT_DISTANCE, 20 ) Status = DftiSetValue( Desc_Handle_Dim1, DFTI_OUTPUT_DISTANCE, 20 ) Status = DftiCommitDescriptor( Desc_Handle_Dim1 ) Status = DftiComputeForward( Desc_Handle_Dim1, X ) ! perform 20 one-dimensional transforms along 2nd dimension Stride(1) = 0; Stride(2) = 20 Status = DftiSetValue( Desc_Handle_Dim2, DFTI_NUMBER_OF_TRANSFORMS, 20 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_INPUT_DISTANCE, 1 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_OUTPUT_DISTANCE, 1 ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_INPUT_STRIDES, Stride ) Status = DftiSetValue( Desc_Handle_Dim2, DFTI_OUTPUT_STRIDES, Stride ) Status = DftiCommitDescriptor( Desc_Handle_Dim2 ) Status = DftiComputeForward( Desc_Handle_Dim2, X ) Status = DftiFreeDescriptor( Desc_Handle_Dim1 ) Status = DftiFreeDescriptor( Desc_Handle_Dim2 ) /* C */ #include "mkl_dfti.h" float _Complex x[20][40]; MKL_LONG stride[2]; MKL_LONG status; DFTI_DESCRIPTOR_HANDLE desc_handle_dim1; DFTI_DESCRIPTOR_HANDLE desc_handle_dim2; //... status = DftiCreateDescriptor( &desc_handle_dim1, DFTI_SINGLE, DFTI_COMPLEX, 1, 20 ); status = DftiCreateDescriptor( &desc_handle_dim2, DFTI_SINGLE, DFTI_COMPLEX, 1, 40 ); /* perform 40 one-dimensional transforms along 1st dimension */ /* note that the 1st dimension data are not unit-stride */ stride[0] = 0; stride[1] = 40; status = DftiSetValue( desc_handle_dim1, DFTI_NUMBER_OF_TRANSFORMS, 40 ); status = DftiSetValue( desc_handle_dim1, DFTI_INPUT_DISTANCE, 1 ); status = DftiSetValue( desc_handle_dim1, DFTI_OUTPUT_DISTANCE, 1 ); status = DftiSetValue( desc_handle_dim1, DFTI_INPUT_STRIDES, stride ); status = DftiSetValue( desc_handle_dim1, DFTI_OUTPUT_STRIDES, stride ); status = DftiCommitDescriptor( desc_handle_dim1 ); status = DftiComputeForward( desc_handle_dim1, x ); C Intel® Math Kernel Library Reference Manual 2660 /* perform 20 one-dimensional transforms along 2nd dimension */ /* note that the 2nd dimension is unit stride */ status = DftiSetValue( desc_handle_dim2, DFTI_NUMBER_OF_TRANSFORMS, 20 ); status = DftiSetValue( desc_handle_dim2, DFTI_INPUT_DISTANCE, 40 ); status = DftiSetValue( desc_handle_dim2, DFTI_OUTPUT_DISTANCE, 40 ); status = DftiCommitDescriptor( desc_handle_dim2 ); status = DftiComputeForward( desc_handle_dim2, x ); status = DftiFreeDescriptor( &desc_handle_dim1 ); status = DftiFreeDescriptor( &desc_handle_dim2 ); The following are examples of real multi-dimensional transforms with CCE format storage of conjugate-even complex matrix. Example “Two-Dimensional REAL In-place FFT (Fortran Interface)” is two-dimensional inplace transform and Example “Two-Dimensional REAL Out-of-place FFT (Fortran Interface)” is twodimensional out-of-place transform in Fortran interface. Example “Three-Dimensional REAL FFT (C Interface)” is three-dimensional out-of-place transform in C interface. Note that the data and result parameters in computation functions are all declared as assumed-size rank-1 array DIMENSION(0:*). Therefore two-dimensional array must be transformed to one-dimensional array by EQUIVALENCE statement or other facilities of Fortran. Two-Dimensional REAL In-place FFT (Fortran Interface) ! Fortran example. ! 2D and real to conjugate-even Use MKL_DFTI Real :: X_2D(34,100) ! 34 = (32/2 + 1)*2 Real :: X(3400) Equivalence (X_2D, X) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status, L(2) Integer :: strides_in(3) Integer :: strides_out(3) ! ...put input data into X_2D(j,k), 1<=j=32,1<=k<=100 ! ...set L(1) = 32, L(2) = 100 ! ...set strides_in(1) = 0, strides_in(2) = 1, strides_in(3) = 34 ! ...set strides_out(1) = 0, strides_out(2) = 1, strides_out(3) = 17 ! ...the transform is a 32-by-100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L ) Status = DftiSetValue(My_Desc_Handle, DFTI_CONJUGATE_EVEN_STORAGE,& DFTI_COMPLEX_COMPLEX) Status = DftiSetValue(My_Desc_Handle, DFTI_INPUT_STRIDES, strides_in) Status = DftiSetValue(My_Desc_Handle, DFTI_OUTPUT_STRIDES, strides_out) Status = DftiCommitDescriptor( My_Desc_Handle) Status = DftiComputeForward( My_Desc_Handle, X ) Status = DftiFreeDescriptor(My_Desc_Handle) ! result is given by the complex value z(j,k) 1<=j<=17; 1<=k<=100 and ! is stored in real matrix X_2D in CCE format. Two-Dimensional REAL Out-of-place FFT (Fortran Interface) ! Fortran example. ! 2D and real to conjugate-even Use MKL_DFTI Real :: X_2D(32,100) Complex :: Y_2D(17, 100) ! 17 = 32/2 + 1 Real :: X(3200) Complex :: Y(1700) Equivalence (X_2D, X) Equivalence (Y_2D, Y) type(DFTI_DESCRIPTOR), POINTER :: My_Desc_Handle Integer :: Status, L(2) Integer :: strides_out(3) Code Examples C 2661 ! ...put input data into X_2D(j,k), 1<=j=32,1<=k<=100 ! ...set L(1) = 32, L(2) = 100 ! ...set strides_out(1) = 0, strides_out(2) = 1, strides_out(3) = 17 ! ...the transform is a 32-by-100 ! Perform a real to complex conjugate-even transform Status = DftiCreateDescriptor( My_Desc_Handle, DFTI_SINGLE,& DFTI_REAL, 2, L ) Status = DftiSetValue(My_Desc_Handle,& DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) Status = DftiSetValue( My_Desc_Handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE ) Status = DftiSetValue(My_Desc_Handle,& DFTI_OUTPUT_STRIDES, strides_out) Status = DftiCommitDescriptor(My_Desc_Handle) Status = DftiComputeForward(My_Desc_Handle, X, Y) Status = DftiFreeDescriptor(My_Desc_Handle) ! result is given by the complex value z(j,k) 1<=j<=17; 1<=k<=100 and ! is stored in complex matrix Y_2D in CCE format. Three-Dimensional REAL FFT (C Interface) /* C99 example */ #include "mkl_dfti.h" float x[32][100][19]; float _Complex y[32][100][10]; /* 10 = 19/2 + 1 */ DFTI_DESCRIPTOR_HANDLE my_desc_handle; MKL_LONG status, l[3]; MKL_LONG strides_out[4]; //...put input data into x[j][k][s] 0<=j<=31, 0<=k<=99, 0<=s<=18 l[0] = 32; l[1] = 100; l[2] = 19; strides_out[0] = 0; strides_out[1] = 1000; strides_out[2] = 10; strides_out[3] = 1; status = DftiCreateDescriptor( &my_desc_handle, DFTI_SINGLE, DFTI_REAL, 3, l ); status = DftiSetValue(my_desc_handle, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX); status = DftiSetValue( my_desc_handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); status = DftiSetValue(my_desc_handle, DFTI_OUTPUT_STRIDES, strides_out); status = DftiCommitDescriptor(my_desc_handle); status = DftiComputeForward(my_desc_handle, x, y); status = DftiFreeDescriptor(&my_desc_handle); /* result is the complex value z(j,k,s) 0<=j<=31; 0<=k<=99, 0<=s<=9 and is stored in complex matrix y in CCE format. */ Examples of Using Multi-Threading for FFT Computation The following sample program shows how to employ internal threading in Intel MKL for FFT computation (see case "a" in “Number of user threads”). To specify the number of threads inside Intel MKL, use the following settings: set MKL_NUM_THREADS = 1 for one-threaded mode; set MKL_NUM_THREADS = 4 for multi-threaded mode. Note that the configuration parameter DFTI_NUMBER_OF_USER_THREADS must be equal to its default value 1. C Intel® Math Kernel Library Reference Manual 2662 Using Intel MKL Internal Threading Mode #include "mkl_dfti.h" int main () { float x[200][100]; DFTI_DESCRIPTOR_HANDLE fft; MKL_LONG len[2] = {200, 100}; // initialize x DftiCreateDescriptor ( &fft, DFTI_SINGLE, DFTI_REAL, 2, len ); DftiCommitDescriptor ( fft ); DftiComputeForward ( fft, x ); DftiFreeDescriptor ( &fft ); return 0; } The following Example “Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region” and Example “Using Parallel Mode with Multiple Descriptors Initialized in One Thread” illustrate a parallel customer program with each descriptor instance used only in a single thread (see cases "b" and "c" in Number of user threads). Specify the number of threads for Example “Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region” like this: set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (recommended); set OMP_NUM_THREADS = 4 for the customer program to work in the multi-threaded mode. The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have its default value of 1. Using Parallel Mode with Multiple Descriptors Initialized in a Parallel Region Note that in this example, the program can be transformed to become single-threaded at the customer level but using parallel mode within Intel MKL (case "a"). To achieve this, you need to set the parameter DFTI_NUMBER_OF_TRANSFORMS = 4 and to set the corresponding parameter DFTI_INPUT_DISTANCE = 5000. C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; int th; // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(len, x) for (th = 0; th < nth; th++) { DFTI_DESCRIPTOR_HANDLE myFFT; DftiCreateDescriptor (&myFFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len); DftiCommitDescriptor (myFFT); DftiComputeForward (myFFT, x[th]); DftiFreeDescriptor (&myFFT); } return 0; } Fortran code for the example is as follows: program fft2d_private_descr_main use mkl_dfti Code Examples C 2663 integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type(dfti_descriptor), pointer :: myFFT integer th, myStatus ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x) PRIVATE(myFFT, myStatus) do th = 1, nth myStatus = DftiCreateDescriptor (myFFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) myStatus = DftiCommitDescriptor (myFFT) myStatus = DftiComputeForward (myFFT, x(:, th)) myStatus = DftiFreeDescriptor (myFFT) end do !$OMP END PARALLEL DO end Specify the number of threads for Example “Using Parallel Mode with Multiple Descriptors Initialized in One Thread” like this: set MKL_NUM_THREADS = 1 for Intel MKL to work in the single-threaded mode (obligatory); set OMP_NUM_THREADS = 4 for the customer program to work in the multi-threaded mode. The configuration parameter DFTI_NUMBER_OF_USER_THREADS must have the default value of 1. Using Parallel Mode with Multiple Descriptors Initialized in One Thread C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; DFTI_DESCRIPTOR_HANDLE FFT[ARRAY_LEN(x)]; int th; for (th = 0; th < nth; th++) DftiCreateDescriptor (&FFT[th], DFTI_SINGLE, DFTI_COMPLEX, 2, len); for (th = 0; th < nth; th++) DftiCommitDescriptor (FFT[th]); // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(FFT, x) for (th = 0; th < nth; th++) DftiComputeForward (FFT[th], x[th]); for (th = 0; th < nth; th++) DftiFreeDescriptor (&FFT[th]); return 0; } Fortran code for the example is as follows: program fft2d_array_descr_main use mkl_dfti integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type thread_data type(dfti_descriptor), pointer :: FFT end type thread_data type(thread_data) :: workload(nth) C Intel® Math Kernel Library Reference Manual 2664 integer th, status, myStatus do th = 1, nth status = DftiCreateDescriptor (workload(th)%FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) status = DftiCommitDescriptor (workload(th)%FFT) end do ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x, workload) PRIVATE(myStatus) do th = 1, nth myStatus = DftiComputeForward (workload(th)%FFT, x(:, th)) end do !$OMP END PARALLEL DO do th = 1, nth status = DftiFreeDescriptor (workload(th)%FFT) end do end The following Example “Using Parallel Mode with a Common Descriptor” illustrates a parallel customer program with a common descriptor used in several threads (see case "d" in “Number of user threads”). In this case, the number of threads, as well as any other configuration parameter, must not be changed after FFT initialization by the DftiCommitDescriptor() function is done. Using Parallel Mode with a Common Descriptor C code for the example is as follows: #include "mkl_dfti.h" #include #define ARRAY_LEN(a) sizeof(a)/sizeof(a[0]) int main () { // 4 OMP threads, each does 2D FFT 50x100 points MKL_Complex8 x[4][50][100]; int nth = ARRAY_LEN(x); MKL_LONG len[2] = {ARRAY_LEN(x[0]), ARRAY_LEN(x[0][0])}; DFTI_DESCRIPTOR_HANDLE FFT; int th; DftiCreateDescriptor (&FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len); DftiSetValue (FFT, DFTI_NUMBER_OF_USER_THREADS, nth); DftiCommitDescriptor (FFT); // assume x is initialized and do 2D FFTs #pragma omp parallel for shared(FFT, x) for (th = 0; th < nth; th++) DftiComputeForward (FFT, x[th]); DftiFreeDescriptor (&FFT); return 0; } Fortran code for the example is as follows: program fft2d_shared_descr_main use mkl_dfti integer nth, len(2) ! 4 OMP threads, each does 2D FFT 50x100 points parameter (nth = 4, len = (/50, 100/)) complex x(len(2)*len(1), nth) type(dfti_descriptor), pointer :: FFT integer th, status, myStatus status = DftiCreateDescriptor (FFT, DFTI_SINGLE, DFTI_COMPLEX, 2, len) status = DftiSetValue (FFT, DFTI_NUMBER_OF_USER_THREADS, nth) status = DftiCommitDescriptor (FFT) ! assume x is initialized and do 2D FFTs !$OMP PARALLEL DO SHARED(len, x, FFT) PRIVATE(myStatus) do th = 1, nth myStatus = DftiComputeForward (FFT, x(:, th)) end do Code Examples C 2665 !$OMP END PARALLEL DO status = DftiFreeDescriptor (FFT) end Examples for Cluster FFT Functions The following C example computes a 2-dimensional out-of-place FFT using the cluster FFT interface: 2D Out-of-place Cluster FFT Computation DFTI_DESCRIPTOR_DM_HANDLE desc; MKL_LONG len[2],v,i,j,n,s; Complex *in,*out; MPI_Init(...); // Create descriptor for 2D FFT len[0]=nx; len[1]=ny; DftiCreateDescriptorDM(MPI_COMM_WORLD,&desc,DFTI_DOUBLE,DFTI_COMPLEX,2,len); // Ask necessary length of in and out arrays and allocate memory DftiGetValueDM(desc,CDFT_LOCAL_SIZE,&v); in=(Complex*)malloc(v*sizeof(Complex)); out=(Complex*)malloc(v*sizeof(Complex)); // Fill local array with initial data. Current process performs n rows, // 0 row of in corresponds to s row of virtual global array DftiGetValueDM(desc,CDFT_LOCAL_NX,&n); DftiGetValueDM(desc,CDFT_LOCAL_X_START,&s); // Virtual global array globalIN is defined by function f as // globalIN[i*ny+j]=f(i,j) for(i=0;ipolar conversion of complex data // Cartesian representation: z = re + I*im // Polar representation: z = r * exp( I*phi ) #include void variant1_Cartesian2Polar(int n,const double *re,const double *im, double *r,double *phi) { vdHypot(n,re,im,r); // compute radii r[] vdAtan2(n,im,re,phi); // compute phases phi[] } void variant2_Cartesian2Polar(int n,const MKL_Complex16 *z,double *r,double *phi, double *temp_re,double *temp_im) { vzAbs(n,z,r); // compute radii r[] vdPackI(n, (double*)z + 0, 2, temp_re); vdPackI(n, (double*)z + 1, 2, temp_im); vdAtan2(n,temp_im,temp_re,phi); // compute phases phi[] } Conversion from polar to Cartesian representation of complex data // Polar->Cartesian conversion of complex data. // Polar representation: z = r * exp( I*phi ) // Cartesian representation: z = re + I*im #include void variant1_Polar2Cartesian(int n,const double *r,const double *phi, double *re,double *im) { vdSinCos(n,phi,im,re); // compute direction, i.e. z[]/abs(z[]) vdMul(n,r,re,re); // scale real part vdMul(n,r,im,im); // scale imaginary part } void variant2_Polar2Cartesian(int n,const double *r,const double *phi, MKL_Complex16 *z, double *temp_re,double *temp_im) { Code Examples C 2667 vdSinCos(n,phi,temp_im,temp_re); // compute direction, i.e. z[]/abs(z[]) vdMul(n,r,temp_im,temp_im); // scale imaginary part vdMul(n,r,temp_re,temp_re); // scale real part vdUnpackI(n,temp_re,(double*)z + 0, 2); // fill in result.re vdUnpackI(n,temp_im,(double*)z + 1, 2); // fill in result.im } C Intel® Math Kernel Library Reference Manual 2668 CBLAS Interface to the BLAS D This appendix presents CBLAS, the C interface to the Basic Linear Algebra Subprograms (BLAS) implemented in Intel® MKL. Similar to BLAS, the CBLAS interface includes the following levels of functions: • “Level 1 CBLAS” (vector-vector operations) • “Level 2 CBLAS” (matrix-vector operations) • “Level 3 CBLAS” (matrix-matrix operations). • “Sparse CBLAS” (operations on sparse vectors). To obtain the C interface, the Fortran routine names are prefixed with cblas_ (for example, dasum becomes cblas_dasum). Names of all CBLAS functions are in lowercase letters. Complex functions ?dotc and ?dotu become CBLAS subroutines (void functions); they return the complex result via a void pointer, added as the last parameter. CBLAS names of these functions are suffixed with _sub. For example, the BLAS function cdotc corresponds to cblas_cdotc_sub. WARNING Users of the CBLAS interface should be aware that the CBLAS are just a C interface to the BLAS, which is based on the FORTRAN standard and subject to the FORTRAN standard restrictions. In particular, the output parameters should not be referenced through more than one argument. In the descriptions of CBLAS interfaces, links provided for each function group lead to the descriptions of the respective Fortran-interface BLAS functions. CBLAS Arguments The arguments of CBLAS functions comply with the following rules: • Input arguments are declared with the const modifier. • Non-complex scalar input arguments are passed by value. • Complex scalar input arguments are passed as void pointers. • Array arguments are passed by address. • BLAS character arguments are replaced by the appropriate enumerated type. • Level 2 and Level 3 routines acquire an additional parameter of type CBLAS_ORDER as their first argument. This parameter specifies whether two-dimensional arrays are row-major (CblasRowMajor) or column-major (CblasColMajor). Enumerated Types The CBLAS interface uses the following enumerated types: enum CBLAS_ORDER { CblasRowMajor=101, /* row-major arrays */ CblasColMajor=102}; /* column-major arrays */ enum CBLAS_TRANSPOSE { CblasNoTrans=111, /* trans='N' */ CblasTrans=112, /* trans='T' */ CblasConjTrans=113}; /* trans='C' */ enum CBLAS_UPLO { CblasUpper=121, /* uplo ='U' */ CblasLower=122}; /* uplo ='L' */ enum CBLAS_DIAG { CblasNonUnit=131, /* diag ='N' */ CblasUnit=132}; /* diag ='U' */ 2669 enum CBLAS_SIDE { CblasLeft=141, /* side ='L' */ CblasRight=142}; /* side ='R' */ Level 1 CBLAS This is an interface to “BLAS Level 1 Routines and Functions”, which perform basic vector-vector operations. ?asum float cblas_sasum(const int N, const float *X, const int incX); double cblas_dasum(const int N, const double *X, const int incX); float cblas_scasum(const int N, const void *X, const int incX); double cblas_dzasum(const int N, const void *X, const int incX); ?axpy void cblas_saxpy(const int N, const float alpha, const float *X, const int incX, float *Y, const int incY); void cblas_daxpy(const int N, const double alpha, const double *X, const int incX, double *Y, const int incY); void cblas_caxpy(const int N, const void *alpha, const void *X, const int incX, void *Y, const int incY); void cblas_zaxpy(const int N, const void *alpha, const void *X, const int incX, void *Y, const int incY); ?copy void cblas_scopy(const int N, const float *X, const int incX, float *Y, const int incY); void cblas_dcopy(const int N, const double *X, const int incX, double *Y, const int incY); void cblas_ccopy(const int N, const void *X, const int incX, void *Y, const int incY); void cblas_zcopy(const int N, const void *X, const int incX, void *Y, const int incY); ?dot float cblas_sdot(const int N, const float *X, const int incX, const float *Y, const int incY); double cblas_ddot(const int N, const double *X, const int incX, const double *Y, const int incY); ?sdot float cblas_sdsdot(const int N, const float SB, const float *SX, const int incX, const float *SY, const int incY); double cblas_dsdot(const int N, const float *SX, const int incX, const float *SY, const int incY); ?dotc void cblas_cdotc_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotc); void cblas_zdotc_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotc); ?dotu void cblas_cdotu_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotu); void cblas_zdotu_sub(const int N, const void *X, const int incX, const void *Y, const int incY, void *dotu); D Intel® Math Kernel Library Reference Manual 2670 ?nrm2 float cblas_snrm2(const int N, const float *X, const int incX); double cblas_dnrm2(const int N, const double *X, const int incX); float cblas_scnrm2(const int N, const void *X, const int incX); double cblas_dznrm2(const int N, const void *X, const int incX); ?rot void cblas_srot(const int N, float *X, const int incX, float *Y, const int incY, const float c, const float s); void cblas_drot(const int N, double *X, const int incX, double *Y,const int incY, const double c, const double s); ?rotg void cblas_srotg(float *a, float *b, float *c, float *s); void cblas_drotg(double *a, double *b, double *c, double *s); ?rotm void cblas_srotm(const int N, float *X, const int incX, float *Y, const int incY, const float *P); void cblas_drotm(const int N, double *X, const int incX, double *Y, const int incY, const double *P); ?rotmg void cblas_srotmg(float *d1, float *d2, float *b1, const float b2, float *P); void cblas_drotmg(double *d1, double *d2, double *b1, const double b2, double *P); ?scal void cblas_sscal(const int N, const float alpha, float *X, const int incX); void cblas_dscal(const int N, const double alpha, double *X, const int incX); void cblas_cscal(const int N, const void *alpha, void *X, const int incX); void cblas_zscal(const int N, const void *alpha, void *X, const int incX); void cblas_csscal(const int N, const float alpha, void *X, const int incX); void cblas_zdscal(const int N, const double alpha, void *X, const int incX); ?swap void cblas_sswap(const int N, float *X, const int incX, float *Y, const int incY); void cblas_dswap(const int N, double *X, const int incX, double *Y, const int incY); void cblas_cswap(const int N, void *X, const int incX, void *Y, const int incY); void cblas_zswap(const int N, void *X, const int incX, void *Y, const int incY); i?amax CBLAS_INDEX cblas_isamax(const int N, const float *X, const int incX); CBLAS_INDEX cblas_idamax(const int N, const double *X, const int incX); CBLAS_INDEX cblas_icamax(const int N, const void *X, const int incX); CBLAS_INDEX cblas_izamax(const int N, const void *X, const int incX); i?amin CBLAS_INDEX cblas_isamin(const int N, const float *X, const int incX); CBLAS_INDEX cblas_idamin(const int N, const double *X, const int incX); CBLAS_INDEX cblas_icamin(const int N, const void *X, const int incX); CBLAS_INDEX cblas_izamin(const int N, const void *X, const int incX); CBLAS Interface to the BLAS D 2671 ?cabs1 double cblas_dcabs1(const void *z); float cblas_scabs1(const void *c); Level 2 CBLAS This is an interface to “BLAS Level 2 Routines”, which perform basic matrix-vector operations. Each C routine in this group has an additional parameter of type CBLAS_ORDER (the first argument) that determines whether the two-dimensional arrays use column-major or row-major storage. ?gbmv void cblas_sgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); void cblas_cgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zgbmv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const int KL, const int KU, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?gemv void cblas_sgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); void cblas_cgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?ger void cblas_sger(const enum CBLAS_ORDER order, const int M, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A, const int lda); void cblas_dger(const enum CBLAS_ORDER order, const int M, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A, const int lda); ?gerc void cblas_cgerc(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zgerc(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); ?geru void cblas_cgeru(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zgeru(const enum CBLAS_ORDER order, const int M, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); D Intel® Math Kernel Library Reference Manual 2672 ?hbmv void cblas_chbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?hemv void cblas_chemv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhemv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *A, const int lda, const void *X, const int incX, const void *beta, void *Y, const int incY); ?her void cblas_cher(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const void *X, const int incX, void *A, const int lda); void cblas_zher(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const void *X, const int incX, void *A, const int lda); ?her2 void cblas_cher2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); void cblas_zher2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *A, const int lda); ?hpmv void cblas_chpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *Ap, const void *X, const int incX, const void *beta, void *Y, const int incY); void cblas_zhpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *Ap, const void *X, const int incX, const void *beta, void *Y, const int incY); ?hpr void cblas_chpr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const void *X, const int incX, void *A); void cblas_zhpr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const void *X, const int incX, void *A); ?hpr2 void cblas_chpr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *Ap); void cblas_zhpr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const void *alpha, const void *X, const int incX, const void *Y, const int incY, void *Ap); ?sbmv void cblas_ssbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); CBLAS Interface to the BLAS D 2673 void cblas_dsbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const int K, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); ?spmv void cblas_sspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *Ap, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dspmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *Ap, const double *X, const int incX, const double beta, double *Y, const int incY); ?spr void cblas_sspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, float *Ap); void cblas_dspr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, double *Ap); ?spr2 void cblas_sspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A); void cblas_dspr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A); ?symv void cblas_ssymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *A, const int lda, const float *X, const int incX, const float beta, float *Y, const int incY); void cblas_dsymv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY); ?syr void cblas_ssyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, float *A, const int lda); void cblas_dsyr(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, double *A, const int lda); ?syr2 void cblas_ssyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const float alpha, const float *X, const int incX, const float *Y, const int incY, float *A, const int lda); void cblas_dsyr2(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const int N, const double alpha, const double *X, const int incX, const double *Y, const int incY, double *A, const int lda); ?tbmv void cblas_stbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const float *A, const int lda, float *X, const int incX); void cblas_dtbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const double *A, const int lda, double *X, const int incX); void cblas_ctbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); D Intel® Math Kernel Library Reference Manual 2674 void cblas_ztbmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); ?tbsv void cblas_stbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const float *A, const int lda, float *X, const int incX); void cblas_dtbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const double *A, const int lda, double *X, const int incX); void cblas_ctbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); void cblas_ztbsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const int K, const void *A, const int lda, void *X, const int incX); ?tpmv void cblas_stpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const float *Ap, float *X, const int incX); void cblas_dtpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N, const double *Ap, double *X, const int incX); void cblas_ctpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); void cblas_ztpmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); ?tpsv void cblas_stpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *Ap, float *X, const int incX); void cblas_dtpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *Ap, double *X, const int incX); void cblas_ctpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); void cblas_ztpsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *Ap, void *X, const int incX); ?trmv void cblas_strmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *A, const int lda, float *X, const int incX); void cblas_dtrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *A, const int lda, double *X, const int incX); void cblas_ctrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); void cblas_ztrmv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); ?trsv void cblas_strsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const float *A, const int lda, float *X, const int incX); void cblas_dtrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const double *A, const int lda, double *X, const int incX); void cblas_ctrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE CBLAS Interface to the BLAS D 2675 TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); void cblas_ztrsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int N,const void *A, const int lda, void *X, const int incX); Level 3 CBLAS This is an interface to “BLAS Level 3 Routines”, which perform basic matrix-matrix operations. Each C routine in this group has an additional parameter of type CBLAS_ORDER (the first argument) that determines whether the two-dimensional arrays use column-major or row-major storage. ?gemm void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_cgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?hemm void cblas_chemm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zhemm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?herk void cblas_cherk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const void *A, const int lda, const float beta, void *C, const int ldc); void cblas_zherk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const void *A, const int lda, const double beta, void *C, const int ldc); ?her2k void cblas_cher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const float beta, void *C, const int ldc); void cblas_zher2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const double beta, void *C, const int ldc); ?symm void cblas_ssymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_csymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int D Intel® Math Kernel Library Reference Manual 2676 ldb, const void *beta, void *C, const int ldc); void cblas_zsymm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const int M, const int N, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?syrk void cblas_ssyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const float *A, const int lda, const float beta, float *C, const int ldc); void cblas_dsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const double *A, const int lda, const double beta, double *C, const int ldc); void cblas_csyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *beta, void *C, const int ldc); void cblas_zsyrk(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *beta, void *C, const int ldc); ?syr2k void cblas_ssyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc); void cblas_dsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc); void cblas_csyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSP SE Trans, const int N, const int K, const void *alpha,const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); void cblas_zsyr2k(const enum CBLAS_ORDER Order, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE Trans, const int N, const int K, const void *alpha, const void *A, const int lda, const void *B, const int ldb, const void *beta, void *C, const int ldc); ?trmm void cblas_strmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const float alpha, const float *A, const int lda, float *B, const int ldb); void cblas_dtrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const double alpha, const double *A, const int lda, double *B, const int ldb); void cblas_ctrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); void cblas_ztrmm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); ?trsm void cblas_strsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const float alpha, const float *A, const int lda, float *B, const int ldb); void cblas_dtrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const double alpha, const double *A, const int lda, double *B, const int ldb); void cblas_ctrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); void cblas_ztrsm(const enum CBLAS_ORDER Order, const enum CBLAS_SIDE Side, const enum CBLAS_UPLO Uplo, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag, const int M, const int N, const void *alpha, const void *A, const int lda, void *B, const int ldb); CBLAS Interface to the BLAS D 2677 Sparse CBLAS This is an interface to Sparse BLAS Level 1 Routines, which perform a number of common vector operations on sparse vectors stored in compressed form. Note that all index parameters, indx, are in C-type notation and vary in the range [0..N-1]. ?axpyi void cblas_saxpyi(const int N, const float alpha, const float *X, const int *indx, float *Y); void cblas_daxpyi(const int N, const double alpha, const double *X, const int *indx, double *Y); void cblas_caxpyi(const int N, const void *alpha, const void *X, const int *indx, void *Y); void cblas_zaxpyi(const int N, const void *alpha, const void *X, const int *indx, void *Y); ?doti float cblas_sdoti(const int N, const float *X, const int *indx, const float *Y); double cblas_ddoti(const int N, const double *X, const int *indx, const double *Y); ?dotci void cblas_cdotci_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); void cblas_zdotci_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); ?dotui void cblas_cdotui_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); void cblas_zdotui_sub(const int N, const void *X, const int *indx, const void *Y, void *dotui); ?gthr void cblas_sgthr(const int N, const float *Y, float *X, const int *indx); void cblas_dgthr(const int N, const double *Y, double *X, const int *indx); void cblas_cgthr(const int N, const void *Y, void *X, const int *indx); void cblas_zgthr(const int N, const void *Y, void *X, const int *indx); ?gthrz void cblas_sgthrz(const int N, float *Y, float *X, const int *indx); void cblas_dgthrz(const int N, double *Y, double *X, const int *indx); void cblas_cgthrz(const int N, void *Y, void *X, const int *indx); void cblas_zgthrz(const int N, void *Y, void *X, const int *indx); ?roti void cblas_sroti(const int N, float *X, const int *indx, float *Y, const float c, const float s); void cblas_droti(const int N, double *X, const int *indx, double *Y, const double c, const double s); ?sctr void cblas_ssctr(const int N, const float *X, const int *indx, float *Y); void cblas_dsctr(const int N, const double *X, const int *indx, double *Y); void cblas_csctr(const int N, const void *X, const int *indx, void *Y); void cblas_zsctr(const int N, const void *X, const int *indx, void *Y); D Intel® Math Kernel Library Reference Manual 2678 CBLAS Interface to the BLAS D 2679 D Intel® Math Kernel Library Reference Manual 2680 Specific Features of Fortran 95 Interfaces for LAPACK Routines E Intel® MKL implements Fortran 95 interface for LAPACK package, further referred to as MKL LAPACK95, to provide full capacity of MKL FORTRAN 77 LAPACK routines. This is the principal difference of Intel MKL from the Netlib Fortran 95 implementation for LAPACK. A new feature of MKL LAPACK95 by comparison with Intel MKL LAPACK77 implementation is presenting a package of source interfaces along with wrappers that make the implementation compiler-independent. As a result, the MKL LAPACK package can be used in all programming environments intended for Fortran 95. Depending on the degree and type of difference from Netlib implementation, the MKL LAPACK95 interfaces fall into several groups that require different transformations (see “MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation”). The groups are given in full with the calling sequences of the routines and appropriate differences from Netlib analogs. The following conventions are used: ::= ‘(’ ‘)’ ::= {}* ::= < identifier > ::= | ::= ‘,’ ::= ‘[,’ ‘]’ ::= where defined notions are separated from definitions by ::=, notion names are marked by angle brackets, terminals are given in quotes, and {…}* denotes repetition zero, one, or more times. and each should be present in all calls of denoted interface, may be omitted. Comments to interface definitions are provided where necessary. Comment lines begin with character !. Two interfaces with one name are presented when two variants of subroutine calls (separated by types of arguments) exist. Interfaces Identical to Netlib GERFS(A,AF,IPIV,B,X[,TRANS][,FERR][,BERR][,INFO]) GETRI(A,IPIV[,INFO]) GEEQU(A,R,C[,ROWCND][,COLCND][,AMAX][,INFO]) GESV(A,B[,IPIV][,INFO]) GESVX(A,B,X[,AF][,IPIV][,FACT][,TRANS][,EQUED][,R][,C][,FERR][,BERR] [,RCOND][,RPVGRW][,INFO]) GTSV(DL,D,DU,B[,INFO]) GTSVX(DL,D,DU,B,X[,DLF][,DF][,DUF][,DU2][,IPIV][,FACT][,TRANS][,FERR] [,BERR][,RCOND][,INFO]) POSV(A,B[,UPLO][,INFO]) POSVX(A,B,X[,UPLO][,AF][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) PTSV(D,E,B[,INFO]) PTSVX(D,E,B,X[,DF][,EF][,FACT][,FERR][,BERR][,RCOND][,INFO]) SYSV(A,B[,UPLO][,IPIV][,INFO]) SYSVX(A,B,X[,UPLO][,AF][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) HESVX(A,B,X[,UPLO][,AF][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) HESV(A,B[,UPLO][,IPIV][,INFO]) SPSV(AP,B[,UPLO][,IPIV][,INFO]) HPSV(AP,B[,UPLO][,IPIV][,INFO]) SYTRD(A,TAU[,UPLO][,INFO]) ORGTR(A,TAU[,UPLO][,INFO]) HETRD(A,TAU[,UPLO][,INFO]) UNGTR(A,TAU[,UPLO][,INFO]) SYGST(A,B[,ITYPE][,UPLO][,INFO]) HEGST(A,B[,ITYPE][,UPLO][,INFO]) 2681 GELS(A,B[,TRANS][,INFO]) GELSY(A,B[,RANK][,JPVT][,RCOND][,INFO]) GELSS(A,B[,RANK][,S][,RCOND][,INFO]) GELSD(A,B[,RANK][,S][,RCOND][,INFO]) GGLSE(A,B,C,D,X[,INFO]) GGGLM(A,B,D,X,Y[,INFO]) SYEV(A,W[,JOBZ][,UPLO][,INFO]) HEEV(A,W[,JOBZ][,UPLO][,INFO]) SYEVD(A,W[,JOBZ][,UPLO][,INFO]) HEEVD(A,W[,JOBZ][,UPLO][,INFO]) STEV(D,E[,Z][,INFO]) STEVD(D,E[,Z][,INFO]) STEVX(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) STEVR(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) GEES(A,WR,WI[,VS][,SELECT][,SDIM][,INFO]) GEES(A,W[,VS][,SELECT][,SDIM][,INFO]) GEESX(A,WR,WI[,VS][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GEESX(A,W[,VS][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GEEV(A,WR,WI[,VL][,VR][,INFO]) GEEV(A,W[,VL][,VR][,INFO]) GEEVX(A,WR,WI[,VL][,VR][,BALANC][,ILO][,IHI][,SCALE][,ABNRM][,RCONDE][,RCONDV][,INFO]) GEEVX(A,W[,VL][,VR][,BALANC][,ILO][,IHI][,SCALE][,ABNRM][,RCONDE] [,RCONDV][,INFO]) GESVD(A,S[,U][,VT][,WW][,JOB][,INFO]) GGSVD(A,B,ALPHA,BETA[,K][,L][,U][,V][,Q][,IWORK][,INFO]) SYGV(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) HEGV(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) SYGVD(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) HEGVD(A,B,W[,ITYPE][,JOBZ][,UPLO][,INFO]) SPGVD(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) HPGVD(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) SPGVX(AP,BP,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) HPGVX(AP,BP,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) SBGVD(AB,BB,W[,UPLO][,Z][,INFO]) HBGVD(AB,BB,W[,UPLO][,Z][,INFO]) SBGVX(AB,BB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) HBGVX(AB,BB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) GGES(A,B,ALPHAR,ALPHAI,BETA[,VSL][,VSR][,SELECT][,SDIM][,INFO]) GGES(A,B,ALPHA,BETA[,VSL][,VSR][,SELECT][,SDIM][,INFO]) GGESX(A,B,ALPHAR,ALPHAI,BETA[,VSL][,VSR][,SELECT][,SDIM][,RCONDE][,RCONDV][,INFO]) GGEV(A,B,ALPHAR,ALPHAI,BETA[,VL][,VR][,INFO]) GGEV(A,B,ALPHA,BETA[,VL][,VR][,INFO]) GGEVX(A,B,ALPHAR,ALPHAI,BETA[,VL][,VR][,BALANC][,ILO][,IHI][,LSCALE][,RSCALE][,ABNRM] [,BBNRM][,RCONDE][,RCONDV][,INFO]) GGEVX(A,B,ALPHA,BETA[,VL][,VR][,BALANC][,ILO][,IHI][,LSCALE][,RSCALE][,ABNRM] [,BBNRM][,RCONDE][,RCONDV][,INFO]) Interfaces with Replaced Argument Names Argument names in the routines of this group are replaced as follows: Netlib Argument Name MKL Argument Name A AB A AP AF AFB AF AFP B BB B BP K KL GBSV(AB,B[,KL][,IPIV][,INFO]) ! netlib: (A,B,K,IPIV,INFO) GBSVX(AB,B,X[,KL][,AFB][,IPIV][,FACT][,TRANS][,EQUED][,R][,C][,FERR] [,BERR][,RCOND][,RPVGRW][,INFO]) ! netlib: (A,B,X,KL,AF,IPIV,FACT,TRANS,EQUED,R,C,FERR, ! BERR,RCOND,RPVGRW,INFO) E Intel® Math Kernel Library Reference Manual 2682 PPSV(AP,B[,UPLO][,INFO]) ! netlib: (A,B,UPLO,INFO) PPSVX(AP,B,X[,UPLO][,AFP][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,FACT,EQUED,S,FERR,BERR,RCOND,INFO)! PBSV(AB,B[,UPLO][,INFO]) ! netlib: (A,B,UPLO,INFO) PBSVX(AB,B,X[,UPLO][,AFB][,FACT][,EQUED][,S][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,FACT,EQUED,S,FERR,BERR,RCOND,INFO)! SPSVX(AP,B,X[,UPLO][,AFP][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,IPIV,FACT,FERR,BERR,RCOND,INFO) HPSVX(AP,B,X[,UPLO][,AFP][,IPIV][,FACT][,FERR][,BERR][,RCOND][,INFO]) ! netlib: (A,B,X,UPLO,AF,IPIV,FACT,FERR,BERR,RCOND,INFO) SPEV(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HPEV(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SPEVD(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HPEVD(AP,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SPEVX(AP,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) HPEVX(AP,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) SBEV(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HBEV(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SBEVD(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) HBEVD(AB,W[,UPLO][,Z][,INFO]) ! netlib: (A,W,UPLO,Z,INFO) SBEVX(AB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,Q,ABSTOL,INFO) HBEVX(AB,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,Q][,ABSTOL][,INFO]) ! netlib: (A,B,W,UPLO,Z,VL,VU,IL,IU,M,IFAIL,Q,ABSTOL,INFO) SPGV(AP,BP,W[,ITYPE][,UPLO][,Z][,INFO]) ! netlib: (A,B,W,ITYPE,UPLO,Z,INFO) HPGV(AB,BP,W[,ITYPE][,UPLO][,Z][,INFO]) ! netlib: (A,B,W,ITYPE,UPLO,Z,INFO) SBGV(AB,BB,W[,UPLO][,Z][,INFO]) ! netlib: (A,B,W,UPLO,Z,INFO) HBGV(AB,BB,W[,UPLO][,Z][,INFO]) ! netlib: (A,B,W,UPLO,Z,INFO) Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2683 Modified Netlib Interfaces SYEVX(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEEVX(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z SYEVR(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,ISUPPZ,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEEVR(A,W[,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,W,JOBZ,UPLO,VL,VU,IL,IU,M,ISUPPZ,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 4, mkl: 3 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z GESDD(A,S[,U][,VT][,JOBZ][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,S,U,VT,WW,JOB,INFO) ! Different number for parameter, netlib: 7, mkl: 6 ! Absent mkl parameter: WW ! Absent mkl parameter: JOB ! Different order for parameter INFO, netlib: 7, mkl: 6 ! Extra mkl parameter: JOBZ SYGVX(A,B,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,B,W,ITYPE,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 6, mkl: 5 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z HEGVX(A,B,W[,ITYPE][,UPLO][,Z][,VL][,VU][,IL][,IU][,M][,IFAIL][,ABSTOL][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,B,W,ITYPE,JOBZ,UPLO,VL,VU,IL,IU,M,IFAIL,ABSTOL,INFO) ! Different order for parameter UPLO, netlib: 6, mkl: 5 ! Absent mkl parameter: JOBZ ! Extra mkl parameter: Z GETRS(A,IPIV,B[,TRANS][,INFO]) ! Interface netlib95 exists: ! Different intents for parameter A, netlib: INOUT, mkl: IN Interfaces Absent From Netlib GTTRF(DL,D,DU,DU2[,IPIV][,INFO]) PPTRF(A[,UPLO][,INFO]) PBTRF(A[,UPLO][,INFO]) PTTRF(D,E[,INFO]) SYTRF(A[,UPLO][,IPIV][,INFO]) HETRF(A[,UPLO][,IPIV][,INFO]) E Intel® Math Kernel Library Reference Manual 2684 SPTRF(A[,UPLO][,IPIV][,INFO]) HPTRF(A[,UPLO][,IPIV][,INFO]) GBTRS(A,B,IPIV[,KL][,TRANS][,INFO]) GTTRS(DL,D,DU,DU2,B,IPIV[,TRANS][,INFO]) POTRS(A,B[,UPLO][,INFO]) PPTRS(A,B[,UPLO][,INFO]) PBTRS(A,B[,UPLO][,INFO]) PTTRS(D,E,B[,INFO]) PTTRS(D,E,B[,UPLO][,INFO]) SYTRS(A,B,IPIV[,UPLO][,INFO]) HETRS(A,B,IPIV[,UPLO][,INFO]) SPTRS(A,B,IPIV[,UPLO][,INFO]) HPTRS(A,B,IPIV[,UPLO][,INFO]) TRTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) TPTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) TBTRS(A,B[,UPLO][,TRANS][,DIAG][,INFO]) GECON(A,ANORM,RCOND[,NORM][,INFO]) GBCON(A,IPIV,ANORM,RCOND[,KL][,NORM][,INFO]) GTCON(DL,D,DU,DU2,IPIV,ANORM,RCOND[,NORM][,INFO]) POCON(A,ANORM,RCOND[,UPLO][,INFO]) PPCON(A,ANORM,RCOND[,UPLO][,INFO]) PBCON(A,ANORM,RCOND[,UPLO][,INFO]) PTCON(D,E,ANORM,RCOND[,INFO]) SYCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) HECON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) SPCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) HPCON(A,IPIV,ANORM,RCOND[,UPLO][,INFO]) TRCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) TPCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) TBCON(A,RCOND[,UPLO][,DIAG][,NORM][,INFO]) GBRFS(A,AF,IPIV,B,X[,KL][,TRANS][,FERR][,BERR][,INFO]) GTRFS(DL,D,DU,DLF,DF,DUF,DU2,IPIV,B,X[,TRANS][,FERR][,BERR][,INFO]) PORFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PPRFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PBRFS(A,AF,B,X[,UPLO][,FERR][,BERR][,INFO]) PTRFS(D,DF,E,EF,B,X[,FERR][,BERR][,INFO]) PTRFS(D,DF,E,EF,B,X[,UPLO][,FERR][,BERR][,INFO]) SYRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) HERFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) SPRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) HPRFS(A,AF,IPIV,B,X[,UPLO][,FERR][,BERR][,INFO]) TRRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) TPRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) TBRFS(A,B,X[,UPLO][,TRANS][,DIAG][,FERR][,BERR][,INFO]) POTRI(A[,UPLO][,INFO]) PPTRI(A[,UPLO][,INFO]) SYTRI(A,IPIV[,UPLO][,INFO]) HETRI(A,IPIV[,UPLO][,INFO]) SPTRI(A,IPIV[,UPLO][,INFO]) HPTRI(A,IPIV[,UPLO][,INFO]) TRTRI(A[,UPLO][,DIAG][,INFO]) TPTRI(A[,UPLO][,DIAG][,INFO]) GBEQU(A,R,C[,KL][,ROWCND][,COLCND][,AMAX][,INFO]) POEQU(A,S[,SCOND][,AMAX][,INFO]) PPEQU(A,S[,SCOND][,AMAX][,UPLO][,INFO]) PBEQU(A,S[,SCOND][,AMAX][,UPLO][,INFO]) HESV(A,B[,UPLO][,IPIV][,INFO]) HPSV(A,B[,UPLO][,IPIV][,INFO]) GEQRF(A[,TAU][,INFO]) GEQPF(A,JPVT[,TAU][,INFO]) GEQP3(A,JPVT[,TAU][,INFO]) ORGQR(A,TAU[,INFO]) ORMQR(A,TAU,C[,SIDE][,TRANS][,INFO]) UNGQR(A,TAU[,INFO]) UNMQR(A,TAU,C[,SIDE][,TRANS][,INFO]) GELQF(A[,TAU][,INFO]) ORGLQ(A,TAU[,INFO]) ORMLQ(A,TAU,C[,SIDE][,TRANS][,INFO]) UNGLQ(A,TAU[,INFO]) UNMLQ(A,TAU,C[,SIDE][,TRANS][,INFO]) GEQLF(A[,TAU][,INFO]) Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2685 ORGQL(A,TAU[,INFO]) UNGQL(A,TAU[,INFO]) ORMQL(A,TAU,C[,SIDE][,TRANS][,INFO]) UNMQL(A,TAU,C[,SIDE][,TRANS][,INFO]) GERQF(A[,TAU][,INFO]) ORGRQ(A,TAU[,INFO]) UNGRQ(A,TAU[,INFO]) ORMRQ(A,TAU,C[,SIDE][,TRANS][,INFO]) UNMRQ(A,TAU,C[,SIDE][,TRANS][,INFO]) TZRZF(A[,TAU][,INFO]) ORMRZ(A,TAU,C,L[,SIDE][,TRANS][,INFO]) UNMRZ(A,TAU,C,L[,SIDE][,TRANS][,INFO]) GGQRF(A,B[,TAUA][,TAUB][,INFO]) GGRQF(A,B[,TAUA][,TAUB][,INFO]) GEBRD(A[,D][,E][,TAUQ][,TAUP][,INFO]) GBBRD(A[,C][,D][,E][,Q][,PT][,KL][,M][,INFO]) ORGBR(A,TAU[,VECT][,INFO]) ORMBR(A,TAU,C[,VECT][,SIDE][,TRANS][,INFO]) ORMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) UNGBR(A,TAU[,VECT][,INFO]) UNMBR(A,TAU,C[,VECT][,SIDE][,TRANS][,INFO]) BDSQR(D,E[,VT][,U][,C][,UPLO][,INFO]) BDSDC(D,E[,U][,VT][,Q][,IQ][,UPLO][,INFO]) UNMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) SPTRD(A,TAU[,UPLO][,INFO]) OPGTR(A,TAU,Q[,UPLO][,INFO]) OPMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) HPTRD(A,TAU[,UPLO][,INFO]) UPGTR(A,TAU,Q[,UPLO][,INFO]) UPMTR(A,TAU,C[,SIDE][,UPLO][,TRANS][,INFO]) SBTRD(A[,Q][,VECT][,UPLO][,INFO]) HBTRD(A[,Q][,VECT][,UPLO][,INFO]) STERF(D,E[,INFO]) STEQR(D,E[,Z][,COMPZ][,INFO]) STEDC(D,E[,Z][,COMPZ][,INFO]) STEGR(D,E,W[,Z][,VL][,VU][,IL][,IU][,M][,ISUPPZ][,ABSTOL][,INFO]) PTEQR(D,E[,Z][,COMPZ][,INFO]) STEBZ(D,E,M,NSPLIT,W,IBLOCK,ISPLIT[,ORDER][,VL][,VU][,IL][,IU][,ABSTOL][,INFO]) STEIN(D,E,W,IBLOCK,ISPLIT,Z[,IFAILV][,INFO]) DISNA(D,SEP[,JOB][,MINMN][,INFO]) SPGST(A,B[,ITYPE][,UPLO][,INFO]) HPGST(A,B[,ITYPE][,UPLO][,INFO]) SBGST(A,B[,X][,UPLO][,INFO]) HBGST(A,B[,X][,UPLO][,INFO]) PBSTF(B[,UPLO][,INFO]) GEHRD(A[,TAU][,ILO][,IHI][,INFO]) ORGHR(A,TAU[,ILO][,IHI][,INFO]) ORMHR(A,TAU,C[,ILO][,IHI][,SIDE][,TRANS][,INFO]) UNGHR(A,TAU[,ILO][,IHI][,INFO]) UNMHR(A,TAU,C[,ILO][,IHI][,SIDE][,TRANS][,INFO]) GEBAL(A[,SCALE][,ILO][,IHI][,JOB][,INFO]) GEBAK(V,SCALE[,ILO][,IHI][,JOB][,SIDE][,INFO]) HSEQR(H,WR,WI[,ILO][,IHI][,Z][,JOB][,COMPZ][,INFO]) HSEQR(H,W[,ILO][,IHI][,Z][,JOB][,COMPZ][,INFO]) HSEIN(H,WR,WI,SELECT[,VL][,VR][,IFAILL][,IFAILR][,INITV][,EIGSRC][,M][,INFO]) HSEIN(H,W,SELECT[,VL][,VR][,IFAILL][,IFAILR][,INITV][,EIGSRC][,M][,INFO]) TREVC(T[,HOWMNY][,SELECT][,VL][,VR][,M][,INFO]) TRSNA(T[,S][,SEP][,VL][,VR][,SELECT][,M][,INFO]) TREXC(T,IFST,ILST[,Q][,INFO]) TRSEN(T,SELECT[,WR][,WI][,M][,S][,SEP][,Q][,INFO]) TRSEN(T,SELECT[,W][,M][,S][,SEP][,Q][,INFO]) TRSYL(A,B,C,SCALE[,TRANA][,TRANB][,ISGN][,INFO]) GGHRD(A,B[,ILO][,IHI][,Q][,Z][,COMPQ][,COMPZ][,INFO]) GGBAL(A,B[,ILO][,IHI][,LSCALE][,RSCALE][,JOB][,INFO]) GGBAK(V[,ILO][,IHI][,LSCALE][,RSCALE][,JOB][,INFO]) HGEQZ(H,T[,ILO][,IHI][,ALPHAR][,ALPHAI][,BETA][,Q][,Z][,JOB][,COMPQ][,COMPZ][,INFO]) HGEQZ(H,T[,ILO][,IHI][,ALPHA][,BETA][,Q][,Z][,JOB][,COMPQ][,COMPZ][,INFO]) TGEVC(S,P[,HOWMNY][,SELECT][,VL][,VR][,M][,INFO]) TGEXC(A,B[,IFST][,ILST][,Z][,Q][,INFO]) TGSEN(A,B,SELECT[,ALPHAR][,ALPHAI][,BETA][,IJOB][,Q][,Z][,PL][,PR][,DIF][,M][,INFO]) TGSEN(A,B,SELECT[,ALPHA][,BETA][,IJOB][,Q][,Z][,PL][,PR][,DIF][,M][,INFO]) E Intel® Math Kernel Library Reference Manual 2686 TGSYL(A,B,C,D,E,F[,IJOB][,TRANS][,SCALE][,DIF][,INFO]) TGSNA(A,B[,S][,DIF][,VL][,VR][,SELECT][,M][,INFO]) GGSVP(A,B,TOLA,TOLB[,K][,L][,U][,V][,Q][,INFO]) TGSJA(A,B,TOLA,TOLB,K,L[,U][,V][,Q][,JOBU][,JOBV][,JOBQ][,ALPHA][,BETA][,NCYCLE][,INFO]) Interfaces of New Functionality GETRF(A[,IPIV][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,IPIV,RCOND,NORM,INFO) ! Different number for parameter, netlib: 5, mkl: 3 ! Different order for parameter INFO, netlib: 5, mkl: 3 ! Absent mkl parameter: NORM ! Absent mkl parameter: RCOND GBTRF(A[,KL][,M][,IPIV][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,K,M,IPIV,RCOND,NORM,INFO) ! Different number for parameter, netlib: 7, mkl: 5 ! Different order for parameter INFO, netlib: 7, mkl: 5 ! Absent mkl parameter: NORM ! Replace parameter name: netlib: K: mkl: KL ! Absent mkl parameter: RCOND POTRF(A[,UPLO][,INFO]) ! Interface netlib95 exists, parameters: ! netlib: (A,UPLO,RCOND,NORM,INFO) ! Different number for parameter, netlib: 5, mkl: 3 ! Different order for parameter INFO, netlib: 5, mkl: 3 ! Absent mkl parameter: NORM ! Absent mkl parameter: RCOND Specific Features of Fortran 95 Interfaces for LAPACK Routines E 2687 E Intel® Math Kernel Library Reference Manual 2688 FFTW Interface to Intel® Math Kernel Library F Intel® Math Kernel Library (Intel® MKL) offers FFTW2 and FFTW3 interfaces to Intel MKL Fast Fourier Transform and Trigonometric Transform functionality. The purpose of these interfaces is to enable applications using FFTW (www.fftw.org) to gain performance with Intel MKL without changing the program source code. Both FFTW2 and FFTW3 interfaces are provided in open source as FFTW wrappers to Intel MKL. For ease of use, FFTW3 interface is also integrated in Intel MKL. Notational Conventions This appendix typically employs path notations for Windows* OS. FFTW2 Interface to Intel® Math Kernel Library This section describes a collection of wrappers providing FFTW 2.x interface to Intel MKL. The wrappers translate calls to FFTW 2.x functions into the calls of the Intel MKL Fast Fourier Transform interface (FFT interface). The wrappers correspond to the FFTW version 2.x and the Intel MKL versions 7.0 or higher. Because of differences between FFTW and Intel MKL FFT functionalities, there are restrictions on using wrappers instead of the FFTW functions. Some FFTW functions have empty wrappers. However, many typical FFTs can be computed using these wrappers. Refer to chapter 11 "Fourier Transform Functions", for better understanding the effects from the use of the wrappers. More wrappers may be added in the future to extend FFTW functionality available with Intel MKL. Wrappers Reference The section provides a brief reference for the FFTW 2.x C interface. For details please refer to the original FFTW 2.x documentation available at www.fftw.org. Each FFTW function has its own wrapper. Some of them, which are not expressly listed in this section, are empty and do nothing, but they are provided to avoid link errors and satisfy the function calls. Intel MKL FFT interface operates on both float and double-precision data types. One-dimensional Complex-to-complex FFTs The following functions compute a one-dimensional complex-to-complex Fast Fourier transform. fftw_plan fftw_create_plan(int n, fftw_direction dir, int flags); fftw_plan fftw_create_plan_specific(int n, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); void fftw(fftw_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftw_one(fftw_plan plan, fftw_complex *in , fftw_complex *out); void fftw_destroy_plan(fftw_plan plan); 2689 Multi-dimensional Complex-to-complex FFTs The following functions compute a multi-dimensional complex-to-complex Fast Fourier transform. fftwnd_plan fftwnd_create_plan(int rank, const int *n, fftw_direction dir, int flags); fftwnd_plan fftw2d_create_plan(int nx, int ny, fftw_direction dir, int flags); fftwnd_plan fftw3d_create_plan(int nx, int ny, int nz, fftw_direction dir, int flags); fftwnd_plan fftwnd_create_plan_specific(int rank, const int *n, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); fftwnd_plan fftw2d_create_plan_specific(int nx, int ny, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); fftwnd_plan fftw3d_create_plan_specific(int nx, int ny, int nz, fftw_direction dir, int flags, fftw_complex *in, int istride, fftw_complex *out, int ostride); void fftwnd(fftwnd_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftwnd_one(fftwnd_plan plan, fftw_complex *in, fftw_complex *out); void fftwnd_destroy_plan(fftwnd_plan plan); One-dimensional Real-to-half-complex/Half-complex-to-real FFTs Half-complex representation of a conjugate-even symmetric vector of size N in a real array of the same size N consists of N/2+1 real parts of the elements of the vector followed by non-zero imaginary parts in the reverse order. Because the Intel MKL FFT interface does not currently support this representation, all wrappers of this kind are empty and do nothing. Nevertheless, you can perform one-dimensional real-to-complex and complex-to-real transforms using rfftwnd functions with rank=1. See Also Multi-dimensional Real-to-complex/Complex-to-real FFTs Multi-dimensional Real-to-complex/Complex-to-real FFTs The following functions compute multi-dimensional real-to-complex and complex-to-real Fast Fourier transforms. rfftwnd_plan rfftwnd_create_plan(int rank, const int *n, fftw_direction dir, int flags); rfftwnd_plan rfftw2d_create_plan(int nx, int ny, fftw_direction dir, int flags); rfftwnd_plan rfftw3d_create_plan(int nx, int ny, int nz, fftw_direction dir, int flags); rfftwnd_plan rfftwnd_create_plan_specific(int rank, const int *n, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); rfftwnd_plan rfftw2d_create_plan_specific(int nx, int ny, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); rfftwnd_plan rfftw3d_create_plan_specific(int nx, int ny, int nz, fftw_direction dir, int flags, fftw_real *in, int istride, fftw_real *out, int ostride); void rfftwnd_real_to_complex(rfftwnd_plan plan, int howmany, fftw_real *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void rfftwnd_complex_to_real(rfftwnd_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_real *out, int ostride, int odist); void rfftwnd_one_real_to_complex(rfftwnd_plan plan, fftw_real *in, fftw_complex *out); F Intel® Math Kernel Library Reference Manual 2690 void rfftwnd_one_complex_to_real(rfftwnd_plan plan, fftw_complex *in, fftw_real *out); void rfftwnd_destroy_plan(rfftwnd_plan plan); Multi-threaded FFTW This section discusses multi-threaded FFTW wrappers only. MPI FFTW wrappers, available only with Intel MKL for the Linux* and Windows* operating systems, are described in section "MPI FFTW Wrappers". Unlike the original FFTW interface, every computational function in the FFTW2 interface to Intel MKL provides multithreaded computation by default, with the number of threads defined by the number of processors available on the system (see section "Managing Performance and Memory" in the Intel MKL User's Guide). To limit the number of threads that use the FFTW interface, call the threaded FFTW computational functions: void fftw_threads(int nthreads, fftw_plan plan, int howmany, fftw_complex *in, int istride, int idist, fftw_complex *out, int ostride, int odist); void fftw_threads_one(int nthreads, rfftwnd_plan plan, fftw_complex *in, fftw_complex *out); ... void rfftwnd_threads_real_to_complex( int nthreads, rfftwnd_plan plan, int howmany, fftw_real *in, int istride, int idist, fftw_complex *out, int ostride, int odist); Compared to its non-threaded counterpart, every threaded computational function has threads_ as the second part of its name and additional first parameter nthreads. Set the nthreads parameter to the thread limit to ensure that the computation requires at most that number of threads. FFTW Support Functions The FFTW wrappers provide memory allocation functions to be used with FFTW: void* fftw_malloc(size_t n); void fftw_free(void* x); The fftw_malloc wrapper aligns the memory on a 16-byte boundary. If fftw_malloc fails to allocate memory, it aborts the application. To override this behavior, set a global variable fftw_malloc_hook and optionally the complementary variable fftw_free_hook: void *(*fftw_malloc_hook) (size_t n); void (*fftw_free_hook) (void *p); The wrappers use the function fftw_die to abort the application in cases when a caller cannot be informed of an error otherwise (for example, in computational functions that return void). To override this behavior, set a global variable fftw_die_hook: void (*fftw_die_hook)(const char *error_string); void fftw_die(const char *s); Limitations of the FFTW2 Interface to Intel MKL The FFTW2 wrappers implement the functionality of only those FFTW functions that Intel MKL can reasonably support. Other functions are provided as no-operation functions, whose only purpose is to satisfy link-time symbol resolution. Specifically, no-operation functions include: • Real-to-half-complex and respective backward transforms • Print plan functions • Functions for importing/exporting/forgetting wisdom • Most of the FFTW functions not covered by the original FFTW2 documentation Because the Intel MKL implementation of FFTW2 wrappers does not use plan and plan node structures declared in fftw.h, the behavior of an application that relies on the internals of the plan structures defined in that header file is undefined. FFTW Interface to Intel® Math Kernel Library F 2691 FFTW2 wrappers define plan as a set of attributes, such as strides, used to commit the Intel MKL FFT descriptor structure. If an FFTW2 computational function is called with attributes different from those recorded in the plan, the function attempts to adjust the attributes of the plan and recommit the descriptor. Thus, repeated calls of a computational function with the same plan but different strides, distances, and other parameters may be performance inefficient. Plan creation functions disregard most planner flags passed through the flags parameter. These functions take into account only the following values of flags: • FFTW_IN_PLACE If this value of flags is supplied, the plan is marked so that computational functions using that plan ignore the parameters related to output (out, ostride, and odist). Unlike the original FFTW interface, the wrappers never use the out parameter as a scratch space for in-place transforms. • FFTW_THREADSAFE If this value of flags is supplied, the plan is marked read-only. An attempt to change attributes of a read-only plan aborts the application. FFTW wrappers are generally not thread safe. Therefore, do not use the same plan in parallel user threads simultaneously. Calling Wrappers from Fortran The FFTW2 wrappers to Intel MKL provide the following subroutines for calling from Fortran: call fftw_f77_create_plan(plan, n, dir, flags) call fftw_f77(plan, howmany, in, istride, idist, out, ostride, odist) call fftw_f77_one(plan, in, out) call fftw_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call fftw_f77_threads_one(nthreads, plan, in, out) call fftw_f77_destroy_plan(plan) call fftwnd_f77_create_plan(plan, rank, n, dir, flags) call fftw2d_f77_create_plan(plan, nx, ny, dir, flags) call fftw3d_f77_create_plan(plan, nx, ny, nz, dir, flags) call fftwnd_f77(plan, howmany, in, istride, idist, out, ostride, odist) call fftwnd_f77_one(plan, in, out) call fftwnd_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call fftwnd_f77_threads_one(nthreads, plan, in, out) call fftwnd_f77_destroy_plan(plan) call rfftw_f77_create_plan(plan, n, dir, flags) call rfftw_f77(plan, howmany, in, istride, idist, out, ostride, odist) call rfftw_f77_one(plan, in, out) call rfftw_f77_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftw_f77_threads_one(nthreads, plan, in, out) call rfftw_f77_destroy_plan(plan) call rfftwnd_f77_create_plan(plan, rank, n, dir, flags) F Intel® Math Kernel Library Reference Manual 2692 call rfftw2d_f77_create_plan(plan, nx, ny, dir, flags) call rfftw3d_f77_create_plan(plan, nx, ny, nz, dir, flags) call rfftwnd_f77_complex_to_real(plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_one_complex_to_real (plan, in, out) call rfftwnd_f77_real_to_complex(plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_one_real_to_complex (plan, in, out) call rfftwnd_f77_threads_complex_to_real(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_threads_one_complex_to_real(nthreads, plan, in, out) call rfftwnd_f77_threads_real_to_complex(nthreads, plan, howmany, in, istride, idist, out, ostride, odist) call rfftwnd_f77_threads_one_real_to_complex(nthreads, plan, in, out) call rfftwnd_f77_destroy_plan(plan) call fftw_f77_threads_init(info) The FFTW Fortran functions are actually the wrappers to FFTW C functions. So, their functionality and limitations are the same as of the corresponding C wrappers. See Also Wrappers Reference Limitations of the FFTW2 Interface to Intel MKL Installation Wrappers are delivered as source code, which you must compile to build the wrapper library. Then you can substitute the wrapper and Intel MKL libraries for the FFTW library. The source code for the wrappers and makefiles with the wrapper list files are located in the .\interfaces\fftw2xc and .\interfaces\fftw2xf subdirectory in the Intel MKL directory for C and Fortran wrappers, respectively. Creating the Wrapper Library Two header files are used to compile the C wrapper library: fftw2_mkl.h and fftw.h. The fftw2_mkl.h file is located in the .\interfaces\fftw2xc\wrappers subdirectory in the Intel MKL directory. Three header files are used to compile the Fortran wrapper library: fftw2_mkl.h, fftw2_f77_mkl.h, and fftw.h. The fftw2_mkl.h and fftw2_f77_mkl.h files are located in the .\interfaces\fftw2xf \wrappers subdirectory in the Intel MKL directory. The file fftw.h, used to compile libraries for both interfaces and located in the .\include\fftw subdirectory in the Intel MKL directory, slightly differs from the original FFTW (www.fftw.org) header file fftw.h. The source code for the wrappers, makefiles, and function list files are located in subdirectories . \interfaces\fftw2xc and .\interfaces\fftw2xf in the Intel MKL directory for C and Fortran wrappers, respectively. A wrapper library contains C or Fortran wrappers for complex and real transforms in a serial and multithreaded mode for one of the two data types (double or float). A makefile parameter manages the data type. The makefile parameters specify the platform (required), compiler, and data precision. Specifying the platform is required. The makefile comment heading provides the exact description of these parameters. FFTW Interface to Intel® Math Kernel Library F 2693 Because a C compiler builds the Fortran wrapper library, function names in the wrapper library and Fortran object module may be different. The file fftw2_f77_mkl.h in the .\interfaces\fftw2xf\source subdirectory in the Intel MKL directory defines function names according to the names in the Fortran module. If a required name is missing in the file, you can modify the file to add the name before building the library. To build the library, run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with appropriate parameters. For example, the command make libintel64 builds on Linux OS a double-precision wrapper library for Intel® 64 architecture based applications using the Intel® C++ Compiler or the Intel® Fortran Compiler version 9.1 or higher (compilers and data precision are chosen by default.). Each makefile creates the library in the directory with the Intel MKL libraries corresponding to the used platform. For example, ./lib/ia32 (on Linux OS and Mac OS X) or .\lib\ia32 (on Windows* OS). In the wrapper library names, the suffix corresponds to the used compiler, the letter "f" precedes the underscore for Fortran, and the letter "c" precedes the underscore for C. For example, fftw2xf_intel.lib (on Windows OS); libfftw2xf_intel.a (on Linux OS and Mac OS X); fftw2xc_intel.lib (on Windows OS); libfftw2xc_intel.a (on Linux OS and Mac OS X); fftw2xc_ms.lib (on Windows OS); libfftw2xc_gnu.a (on Linux OS and Mac OS X). Application Assembling Use the necessary original FFTW (www.fftw.org) header files without any modifications. Use the created wrapper library and the Intel MKL library instead of the FFTW library. Running Examples Intel MKL provides examples to demonstrate how to use the MPI FFTW wrapper library. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples \fftw2xc and .\examples\fftw2xf subdirectories in the Intel MKL directory for C and Fortran, respectively. To build examples, several additional files are needed: fftw.h, fftw_threads.h, rfftw.h, rfftw_threads.h, and fftw_f77.I. These files are distributed with permission from FFTW and are available in .\include\fftw. The original files can also be found in FFTW 2.1.5 at http://www.fftw.org/ download.html. An example makefile uses the function parameter in addition to the parameters that the respective wrapper library makefile uses (see Creating a Wrapper Library). The makefile comment heading provides the exact description of these parameters. An example makefile normally invokes examples. However, if the appropriate wrapper library is not yet created, the makefile first builds the library the same way as the wrapper library makefile does and then proceeds to examples. If the parameter function= is defined, only the specified example runs. Otherwise, all examples from the appropriate subdirectory run. The subdirectory .\_results is created, and the results are stored there in the .res files. MPI FFTW Wrappers MPI FFTW wrappers for FFTW 2 are available only with Intel® MKL for the Linux* and Windows* operating systems. MPI FFTW Wrappers Reference The section provides a reference for MPI FFTW C interface. F Intel® Math Kernel Library Reference Manual 2694 Complex MPI FFTW Complex One-dimensional MPI FFTW Transforms fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, fftw_direction dir, int flags); void fftw_mpi(fftw_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work); void fftw_mpi_local_sizes(fftw_mpi_plan p, int *local_n, int *local_start, int *local_n_after_transform, int *local_start_after_transform, int *total_local_size); void fftw_mpi_destroy_plan(fftw_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE, FFTW_MEASURE, FFTW_SCRAMBLED_INPUT and FFTW_SCRAMBLED_OUTPUT. The same algorithm corresponds to all these values of the flags parameter. If any other flags value is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. Complex Multi-dimensional MPI FFTW Transforms fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags); fftwnd_mpi_plan fftw3d_mpi_create_plan(MPI_Comm comm, int nx, int ny, int nz, fftw_direction dir, int flags); fftwnd_mpi_plan fftwnd_mpi_create_plan(MPI_Comm comm, int dim, int *n, fftw_direction dir, int flags); void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work, fftwnd_mpi_output_order output_order); void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size); void fftwnd_mpi_destroy_plan(fftwnd_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE and FFTW_MEASURE. If any other value of flags is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. Real MPI FFTW Real-to-Complex MPI FFTW Transforms rfftwnd_mpi_plan rfftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags); rfftwnd_mpi_plan rfftw3d_mpi_create_plan(MPI_Comm comm, int nx, int ny, int nz, fftw_direction dir, int flags); rfftwnd_mpi_plan rfftwnd_mpi_create_plan(MPI_Comm comm, int dim, int *n, fftw_direction dir, int flags); void rfftwnd_mpi(rfftwnd_mpi_plan p, int n_fields, fftw_real *local_data, fftw_real *work, fftwnd_mpi_output_order output_order); FFTW Interface to Intel® Math Kernel Library F 2695 void rfftwnd_mpi_local_sizes(rfftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size); void rfftwnd_mpi_destroy_plan(rfftwnd_mpi_plan plan); Argument restrictions: • Supported values of flags are FFTW_ESTIMATE and FFTW_MEASURE. If any other value of flags is supplied, the wrapper library reports an error 'CDFT error in wrapper: unknown flags'. • The only supported value of n_fields is 1. • Function rfftwnd_mpi_create_plan can be used for both one-dimensional and multi-dimensional transforms. • Both values of the output_order parameter are supported: FFTW_NORMAL_ORDER and FFTW_TRANSPOSED_ORDER. Creating MPI FFTW Wrapper Library The source code for the wrappers, makefile, and wrapper list file are located in the .\interfaces \fftw2x_cdft subdirectory in the Intel MKL directory. A wrapper library contains C wrappers for Complex One-dimensional MPI FFTW Transforms and Complex Multi-dimensional MPI FFTW Transforms. The library also contains empty C wrappers for Real Multidimensional MPI FFTW Transforms. For details, see MPI FFTW Wrappers Reference. The makefile parameters specify the platform (required), compiler, and data precision. Specifying the platform is required. The makefile comment heading provides the exact description of these parameters. To build the library, run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with appropriate parameters. For example, the command make libintel64 builds on Linux OS a double-precision wrapper library for Intel® 64 architecture based applications using Intel MPI 2.0 and the Intel® C++ Compiler version 9.1 or higher (compilers and data precision are chosen by default.). The makefile creates the wrapper library in the directory with the Intel MKL libraries corresponding to the used platform. For example, ./lib/ia32 (on Linux OS) or .\lib\ia32 (on Windows* OS). In the wrapper library names, the suffix corresponds to the used data precision. For example, fftw2x_cdft_SINGLE.lib on Windows OS; libfftw2x_cdft_DOUBLE.a on Linux OS. Application Assembling with MPI FFTW Wrapper Library Use the necessary original FFTW (www.fftw.org) header files without any modifications. Use the created MPI FFTW wrapper library and the Intel MKL library instead of the FFTW library. Running Examples There are some examples that demonstrate how to use the MPI FFTW wrapper library for FFTW2. The source C code for the examples, makefiles used to run them, and the example list files are located in the . \examples\fftw2x_cdft subdirectory in the Intel MKL directory. To build examples, one additional file fftw_mpi.h is needed. This file is distributed with permission from FFTW and is available in .\include \fftw. The original file can also be found in FFTW 2.1.5 at http://www.fftw.org/download.html. Parameters for the example makefiles are described in the makefile comment headings and are similar to the wrapper library makefile parameters (see Creating MPI FFTW Wrapper Library). The table below lists examples available in the .\examples\fftw2x_cdft\source subdirectory. F Intel® Math Kernel Library Reference Manual 2696 Examples of MPI FFTW Wrappers Source file for the example Description wrappers_c1d.c One-dimensional Complex MPI FFTW transform, using plan = fftw_mpi_create_plan(...) wrappers_c2d.c Two-dimensional Complex MPI FFTW transform, using plan = fftw2d_mpi_create_plan(...) wrappers_c3d.c Three-dimensional Complex MPI FFTW transform, using plan = fftw3d_mpi_create_plan(...) wrappers_c4d.c Four-dimensional Complex MPI FFTW transform, using plan = fftwnd_mpi_create_plan(...) wrappers_r1d.c One-dimensional Real MPI FFTW transform, using plan = rfftw_mpi_create_plan(...) wrappers_r2d.c Two-dimensional Real MPI FFTW transform, using plan = rfftw2d_mpi_create_plan(...) wrappers_r3d.c Three-dimensional Real MPI FFTW transform, using plan = rfftw3d_mpi_create_plan(...) wrappers_r4d.c Four-dimensional Real MPI FFTW transform, using plan = rfftwnd_mpi_create_plan(...) FFTW3 Interface to Intel® Math Kernel Library This section describes a collection of FFTW3 wrappers to Intel MKL. The wrappers translate calls of FFTW3 functions to the calls of the Intel MKL Fourier transform (FFT) or Trigonometric Transform (TT) functions. The purpose of FFTW3 wrappers is to enable developers whose programs currently use the FFTW3 library to gain performance with the Intel MKL Fourier transforms without changing the program source code. The wrappers correspond to the FFTW release 3.2 and the Intel MKL releases starting with 10.2. For a detailed description of FFTW interface, refer to www.fftw.org. For a detailed description of Intel MKL FFT and TT functionality the wrappers use, see chapter 11 and section "Trigonometric Transform Routines" in chapter 13, respectively. The FFTW3 wrappers provide a limited functionality compared to the original FFTW 3.2 library, because of differences between FFTW and Intel MKL FFT and TT functionality. This section describes limitations of the FFTW3 wrappers and hints for their usage. Nevertheless, many typical FFT tasks can be performed using the FFTW3 wrappers to Intel MKL. More functionality may be added to the wrappers and Intel MKL in the future to reduce the constraints of the FFTW3 interface to Intel MKL. The FFTW3 wrappers are integrated in Intel MKL. The only change required to use Intel MKL through the FFTW3 wrappers is to link your application using FFTW3 against Intel MKL. A reference implementation of the FFTW3 wrappers is also provided in open source. You can find it in the interfaces directory of the Intel MKL distribution. You can use the reference implementation to create your own wrapper library (see Building Your Own Wrapper Library) Using FFTW3 Wrappers The FFTW3 wrappers are a set of functions and data structures depending on one another. The wrappers are not designed to provide the interface on a function-per-function basis. Some FFTW3 wrapper functions are empty and do nothing, but they are present to avoid link errors and satisfy function calls. This manual does not list the declarations of the functions that the FFTW3 wrappers provide (you can find the declarations in the fftw3.h header file). Instead, this section comments particular limitations of the wrappers and provides usage hints: FFTW Interface to Intel® Math Kernel Library F 2697 • The FFTW3 wrappers do not support long double precision because Intel MKL FFT functions operate only on single- and double-precision floating-point data types (float and double, respectively). Therefore the functions with prefix fftwl_, supporting the long double data type, are not provided. • The wrappers provide equivalent implementation for double- and single-precision functions (those with prefixes fftw_ and fftwf_, respectively). So, all these comments equally apply to the double- and single-precision functions and will refer to functions with prefix fftw_, that is, double-precision functions, for brevity. • The FFTW3 interface that the wrappers provide is defined in header files fftw3.h and fftw3.f. These files are borrowed from the FFTW3.2 package and distributed within Intel MKL with permission. Additionally, files fftw3_mkl.h, fftw3_mkl.f, and fftw3_mkl_f77.h define supporting structures, supplementary constants and macros, and expose Fortran interface in C. • Actual functionality of the plan creation wrappers is implemented in guru64 set of functions. Basic interface, advanced interface, and guru interface plan creation functions call the guru64 interface functions. Thus, all types of the FFTW3 plan creation interface in the wrappers are functional. • Plan creation functions may return a NULL plan, indicating that the functionality is not supported. So, please carefully check the result returned by plan creation functions in your application. In particular, the following problems return a NULL plan: – c2r and r2c problems with a split storage of complex data. – r2r problems with kind values FFTW_R2HC, FFTW_HC2R, and FFTW_DHT. The only supported r2r kinds are even/odd DFTs (sine/cosine transforms). – Multidimensional r2r transforms. – Transforms of multidimensional vectors. That is, the only supported values for parameter howmany_rank in guru and guru64 plan creation functions are 0 and 1. – Multidimensional transforms with rank > MKL_MAXRANK. • The MKL_RODFT00 value of the kind parameter is introduced by the FFTW3 wrappers. For better performance, you are strongly encouraged to use this value rather than FFTW_RODFT00. To use this kind value, provide an extra first element equal to 0.0 for the input/output vectors. Consider the following example: plan1 = fftw_plan_r2r_1d(n, in1, out1, FFTW_RODFT00, FFTW_ESTIMATE); plan2 = fftw_plan_r2r_1d(n, in2, out2, MKL_RODFT00, FFTW_ESTIMATE); Both plans perform the same transform, except that the in2/out2 arrays have one extra zero element at location 0. For example, if n=3, in1={x,y,z} and out1={u,v,w}, then in2={0,x,y,z} and out2={0,u,v,w}. • The flags parameter in plan creation functions is always ignored. The same algorithm is used regardless of the value of this parameter. In particular, flags values FFTW_ESTIMATE, FFTW_MEASURE, etc. have no effect. • For multithreaded plans, use normal sequence of calls to the fftw_init_threads() and fftw_plan_with_nthreads() functions (refer to FFTW documentation). • FFTW3 wrappers are not fully thread safe. If the new-array execute functions, such as fftw_execute_dft(), share the same plan from parallel user threads, set the number of the sharing threads before creation of the plan. For this purpose, the FFTW3 wrappers provide a header file fftw3_mkl.h, which defines a global structure fftw3_mkl with a field to be set to the number of sharing threads. Below is an example of setting the number of sharing threads: #include "fftw3.h" #include "fftw3_mkl.h" fftw3_mkl.number_of_user_threads = 4; plan = fftw_plan_dft(...); • Memory allocation function fftw_malloc returns memory aligned at a 16-byte boundary. You must free the memory with fftw_free. • The FFTW3 wrappers to Intel MKL use the 32-bit int type in both LP64 and ILP64 interfaces of Intel MKL. Use guru64 FFTW3 interfaces for 64-bit sizes. • Fortran wrappers (see Calling Wrappers from Fortran) use the INTEGER type, which is 32-bit in LP64 interfaces and 64-bit in ILP64 interfaces. F Intel® Math Kernel Library Reference Manual 2698 • The wrappers typically indicate a problem by returning a NULL plan. In a few cases, the wrappers may report a descriptive message of the problem detected. By default the reporting is turned off. To turn it on, set variable fftw3_mkl.verbose to a non-zero value, for example: #include "fftw3.h" #include "fftw3_mkl.h" fftw3_mkl.verbose = 0; plan = fftw_plan_r2r(...); • The following functions are empty: – For saving, loading, and printing plans – For saving and loading wisdom – For estimating arithmetic cost of the transforms. • Do not use macro FFTW_DLL with the FFTW3 wrappers to Intel MKL. • Do not use negative stride values. Though FFTW3 wrappers support negative strides in the part of advanced and guru FFTW interface, the underlying implementation does not. Calling Wrappers from Fortran Intel MKL also provides Fortran 77 interfaces of the FFTW3 wrappers. The Fortran wrappers are available for all FFTW3 interface functions and are based on C interface of the FFTW3 wrappers. Therefore they have the same functionality and restrictions as the corresponding C interface wrappers. The Fortran wrappers use the default INTEGER type for integer arguments. The default INTEGER is 32-bit in Intel MKL LP64 interfaces and 64-bit in ILP64 interfaces. Argument plan in a Fortran application must have type INTEGER*8. The wrappers that are double-precision subroutines have prefix dfftw_, single-precision subroutines have prefix sfftw_ and provide an equivalent functionality. Long double subroutines (with prefix lfftw_) are not provided. The Fortran FFTW3 wrappers use the default Intel® Fortran compiler convention for name decoration. If your compiler uses a different convention, or if you are using compiler options affecting the name decoration (such as /Qlowercase), you may need to compile the wrappers from sources, as described in section Building Your Own Wrapper Library. For interoperability with C, the declaration of the Fortran FFTW3 interface is provided in header file include/ fftw/fftw3_mkl_f77.h. You can call Fortran wrappers from a FORTRAN 77 or Fortran 90 application, although Intel MKL does not provide a Fortran 90 module for the wrappers. For a detailed description of the FFTW Fortran interface, refer to FFTW3 documentation (www.fftw.org). The following example illustrates calling the FFTW3 wrappers from Fortran: INTEGER*8 plan INTEGER N INCLUDE 'fftw3.f' COMPLEX*16 IN(*), OUT(*) !...initialize array IN CALL DFFTW_PLAN_DFT_1D(PLAN, N, IN, OUT, -1, FFTW_ESTIMATE) IF (PLAN .EQ. 0) STOP CALL DFFTW_EXECUTE !...result is in array OUT Building Your Own Wrapper Library The FFTW3 wrappers to Intel MKL are delivered both integrated in Intel MKL and as source code, which can be compiled to build a standalone wrapper library with exactly the same functionality. Normally you do not need to build the wrappers yourself. However, if your Fortran application is compiled with a compiler that uses a different name decoration than the Intel® Fortran compiler or if you are using compiler options altering the Fortran name decoration, you may need to build the wrappers that use the appropriate name changing convention. FFTW Interface to Intel® Math Kernel Library F 2699 The source code for the wrappers, makefiles, and function list files are located in subdirectories . \interfaces\fftw3xc and .\interfaces\fftw3xf in the Intel MKL directory for C and Fortran wrappers, respectively. To build the wrappers, 1. Change the current directory to the wrapper directory 2. Run the make command on Linux* OS and Mac OS* X or the nmake command on Windows* OS with a required target and optionally several parameters. The target, that is, one of {libia32, libintel64}, defines the platform architecture, and the other parameters facilitate selection of the compiler, size of the default INTEGER type, and placement of the resulting wrapper library. You can find a detailed and up-to-date description of the parameters in the makefile. In the following example, the make command is used to build the FFTW3 Fortran wrappers to MKL for use from the GNU g77 Fortran compiler on Linux OS based on Intel® 64 architecture: cd interfaces/fftw3xf make libintel64 compiler=gnu fname=a_name__ install_to=/my/path This command builds the wrapper library using the GNU gcc compiler, decorates the name with the second underscore, and places the result, named libfftw3xf_gcc.a, into directory /my/path. The name of the resulting library is composed of the name of the compiler used and may be changed by an optional parameter. Building an Application Normally, the only change needed to build your application with FFTW3 wrappers replacing original FFTW library is to add Intel MKL at the link stage (see section "Linking Your Application with Intel® Math Kernel Library" in the Intel MKL User's Guide). If you recompile your application, add subdirectory include\fftw to the search path for header files to avoid FFTW3 version conflicts. Sometimes, you may have to modify your application according to the following recommendations: • The application requires #include "fftw3.h" , which it probably already includes. • The application does not require #include "mkl_dfti.h" . • The application does not require #include "fftw3_mkl.h" . It is required only in case you want to use the MKL_RODFT00 constant. • If the application does not check whether a NULL plan is returned by plan creation functions, this check must be added, because the FFTW3 to Intel MKL wrappers do not provide 100% of FFTW3 functionality. • If the application is threaded, take care about shared plans, because the execute functions in the wrappers are not thread safe, unlike the original FFTW3 functions. See a note about setting fftw3_mkl.number_of_user_threads in section "Using FFTW3 wrappers". Running Examples There are some examples that demonstrate how to use the wrapper library. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples\fftw3xc and .\examples\fftw3xf subdirectories in the Intel MKL directory. To build Fortran examples, one additional file fftw3.f is needed. This file is distributed with permission from FFTW and is available in the . \include\fftw subdirectory of the Intel MKL directory. The original file can also be found in FFTW 3.2 at http://www.fftw.org/download.html. F Intel® Math Kernel Library Reference Manual 2700 Example makefile parameters are similar to the wrapper library makefile parameters. Example makefiles normally build and invoke the examples. If the parameter function= is defined, then only the specified example will run. Otherwise, all examples will be executed. Results of running the examples are saved in subdirectory .\_results in files with extension .res. For detailed information about options for the example makefile, refer to the makefile. MPI FFTW Wrappers This section describes a collection of MPI FFTW wrappers to Intel® MKL. The wrappers correspond to the FFTW 3.3 Alpha release and the Intel MKL releases starting with 10.3. For a detailed description of the MPI FFTW interface, refer to www.fftw.org. MPI FFTW wrappers are available only with Intel MKL for the Linux* and Windows* operating systems. These wrappers translate calls of MPI FFTW functions to the calls of the Intel MKL cluster Fourier transform (CFFT) functions. The purpose of the wrappers is to enable users of MPI FFTW functions improve performance of the applications without changing the program source code. Although the MPI FFTW wrappers provide less functionality than the original FFTW 3.3 because of differences between MPI FFTW and Intel MKL CFFT, the wrappers cover many typical CFFT use cases. The MPI FFTW wrappers are provided as source code. To use the wrappers, you need to build your own wrapper library (see Building Your Own Wrapper Library). See Also Cluster FFT Functions Building Your Own Wrapper Library The MPI FFTW wrappers for FFTW3 are delivered as source code, which can be compiled to build a wrapper library. The source code for the wrappers, makefiles, and function list files are located in subdirectory .\interfaces fftw3x_cdft in the Intel MKL directory. To build the wrappers, 1. Change the current directory to the wrapper directory 2. Run the make command on Linux* OS or the nmake command on Windows* OS with a required target and optionally several parameters. The target, that is, one of {libia32, libintel64}, defines the platform architecture, and the other parameters specify the compiler, size of the default INTEGER type, as well as the name and placement of the resulting wrapper library. You can find a detailed and up-to-date description of the parameters in the makefile. In the following example, the make command is used to build the MPI FFTW wrappers to Intel MKL for use from the GNU C compiler on Linux OS based on Intel® 64 architecture: cd interfaces/fftw3x_cdft make libintel64 compiler=gnu mpi=openmpi INSTALL_DIR=/my/path This command builds the wrapper library using the GNU gcc compiler so that the final user executable can use Open MPI and places the result, named libfftw3x_cdft_DOUBLE.a, into directory /my/path. Building an Application Normally, the only change needed to build your application with MPI FFTW wrappers replacing original FFTW3 library is to add Intel MKL and the wrapper library at the link stage (see section "Linking Your Application with Intel® Math Kernel Library" in the Intel MKL User's Guide). When you are recompiling your application, add subdirectory include\fftw to the search path for header files to avoid FFTW3 version conflicts. FFTW Interface to Intel® Math Kernel Library F 2701 Running Examples There are some examples that demonstrate how to use the MPI FFTW wrapper library for FFTW3. The source code for the examples, makefiles used to run them, and the example list files are located in the .\examples \fftw3x_cdft subdirectory in the Intel MKL directory. Example makefile parameters are similar to the wrapper library makefile parameters. Example makefiles normally build and invoke the examples. Results of running the examples are saved in subdirectory . \_results in files with extension .res. For detailed information about options for the example makefile, refer to the makefile. See Also Building Your Own Wrapper Library F Intel® Math Kernel Library Reference Manual 2702 Bibliography For more information about the BLAS, Sparse BLAS, LAPACK, ScaLAPACK, Sparse Solver, VML, VSL, FFT, and Non-Linear Optimization Solvers functionality, refer to the following publications: • BLAS Level 1 C. Lawson, R. Hanson, D. Kincaid, and F. Krough. Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, Vol.5, No.3 (September 1979) 308-325. • BLAS Level 2 J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol.14, No.1 (March 1988) 1-32. • BLAS Level 3 J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software (December 1989). • Sparse BLAS D. Dodson, R. Grimes, and J. Lewis. Sparse Extensions to the FORTRAN Basic Linear Algebra Subprograms, ACM Transactions on Math Software, Vol.17, No.2 (June 1991). D. Dodson, R. Grimes, and J. Lewis. Algorithm 692: Model Implementation and Test Package for the Sparse Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol.17, No.2 (June 1991). [Duff86] I.S.Duff, A.M.Erisman, and J.K.Reid. Direct Methods for Sparse Matrices. Clarendon Press, Oxford, UK, 1986. [CXML01] Compaq Extended Math Library. Reference Guide, Oct.2001. [Rem05] K.Remington. A NIST FORTRAN Sparse Blas User's Guide. (available on http:// math.nist.gov/~KRemington/fspblas/) [Saad94] Y.Saad. SPARSKIT: A Basic Tool-kit for Sparse Matrix Computation. Version 2, 1994.(http://www.cs.umn.edu/~saad) [Saad96] Y.Saad. Iterative Methods for Linear Systems. PWS Publishing, Boston, 1996. • LAPACK [AndaPark94] A. A. Anda and H. Park. Fast plane rotations with dynamic scaling, SIAM J. matrix Anal. Appl., Vol. 15 (1994), pp. 162-174. [Bischof92] http://citeseer.ist.psu.edu/bischof92framework.html [Demmel92] J. Demmel and K. Veselic. Jacobi's method is more accurate than QR, SIAM J. Matrix Anal. Appl. 13(1992):1204-1246. [deRijk98] P. P. M. De Rijk. A one-sided Jacobi algorithm for computing the singular value decomposition on a vector computer, SIAM J. Sci. Stat. Comp., Vol. 10 (1998), pp. 359-371. [Dhillon04] I. Dhillon, B. Parlett. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices, Linear Algebra and its Applications, 387(1), pp. 1-28, August 2004. [Dhillon04-02] I. Dhillon, B. Parlett. Orthogonal Eigenvectors and * Relative Gaps, SIAM Journal on Matrix Analysis and Applications, Vol. 25, 2004. (Also LAPACK Working Note 154.) [Dhillon97] I. Dhillon. A new O(n^2) algorithm for the symmetric tridiagonal eigenvalue/ eigenvector problem, Computer Science Division Technical Report No. UCB/ CSD-97-971, UC Berkeley, May 1997. [Drmac08-1] Z. Drmac and K. Veselic. New fast and accurate Jacobi SVD algorithm I, SIAM J. Matrix Anal. Appl. Vol. 35, No. 2 (2008), pp. 1322-1342. LAPACK Working note 169. 2703 [Drmac08-2] Z. Drmac and K. Veselic. New fast and accurate Jacobi SVD algorithm II, SIAM J. Matrix Anal. Appl. Vol. 35, No. 2 (2008), pp. 1343-1362. LAPACK Working note 170. [Drmac08-3] Z. Drmac and K. Bujanovic. On the failure of rank-revealing QR factorization software - a case study, ACM Trans. Math. Softw. Vol. 35, No 2 (2008), pp. 1-28. LAPACK Working note 176. [Drmac08-4] Z. Drmac. Implementation of Jacobi rotations for accurate singular value computation in floating point arithmetic, SIAM J. Sci. Comp., Vol. 18 (1997), pp. 1200-1222. [Golub96] G. Golub and C. Van Loan. Matrix Computations, Johns Hopkins University Press, Baltimore, third edition,1996. [LUG] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide, Third Edition, Society for Industrial and Applied Mathematics (SIAM), 1999. [Kahan66] W. Kahan. Accurate Eigenvalues of a Symmetric Tridiagonal Matrix, Report CS41, Computer Science Dept., Stanford University, July 21, 1966. [Marques06] O.Marques, E.J.Riedy, and Ch.Voemel. Benefits of IEEE-754 Features in Modern Symmetric Tridiagonal Eigensolvers, SIAM Journal on Scientific Computing, Vol.28, No.5, 2006. (Tech report version in LAPACK Working Note 172 with the same title.) [Sutton09] Brian D. Sutton. Computing the complete CS decomposition, Numer. Algorithms, 50(1):33-65, 2009. • ScaLAPACK [SLUG] L. Blackford, J. Choi, A.Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K.Stanley, D. Walker, and R. Whaley. ScaLAPACK Users' Guide, Society for Industrial and Applied Mathematics (SIAM), 1997. • Sparse Solver [Duff99] I. S. Duff and J. Koster. The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices. SIAM J. Matrix Analysis and Applications, 20(4):889-901, 1999. [Dong95] J. Dongarra, V.Eijkhout, A.Kalhan. Reverse Communication Interface for Linear Algebra Templates for Iterative Methods. UT-CS-95-291, May 1995. http:// www.netlib.org/lapack/lawnspdf/lawn99.pdf [Karypis98] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1): 359-392, 1998. [Li99] X.S. Li and J.W. Demmel. A Scalable Sparse Direct Solver Using Static Pivoting. In Proceeding of the 9th SIAM conference on Parallel Processing for Scientific Computing, San Antonio, Texas, March 22-34,1999. [Liu85] J.W.H. Liu. Modification of the Minimum-Degree algorithm by multiple elimination. ACM Transactions on Mathematical Software, 11(2):141-153, 1985. [Menon98] R. Menon L. Dagnum. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering, 1:46-55, 1998. http://www.openmp.org. [Saad03] Y. Saad. Iterative Methods for Sparse Linear Systems. 2nd edition, SIAM, Philadelphia, PA, 2003. [Schenk00] O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, 2000. [Schenk00-2] O. Schenk, K. Gartner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT, 40(1): 158-176, 2000. G Intel® Math Kernel Library Reference Manual 2704 [Schenk01] O. Schenk and K. Gartner. Sparse Factorization with Two-Level Scheduling in PARDISO. In Proceeding of the 10th SIAM conference on Parallel Processing for Scientific Computing, Portsmouth, Virginia, March 12-14, 2001. [Schenk02] O. Schenk and K. Gartner. Two-level scheduling in PARDISO: Improved Scalability on Shared Memory Multiprocessing Systems. Parallel Computing, 28:187-197, 2002. [Schenk03] O. Schenk and K. Gartner. Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. Journal of Future Generation Computer Systems, 20(3):475-487, 2004. [Schenk04] O. Schenk and K. Gartner. On Fast Factorization Pivoting Methods for Sparse Symmetric Indefinite Systems. Technical Report, Department of Computer Science, University of Basel, 2004, submitted. [Sonn89] P. Sonneveld. CGS, a Fast Lanczos-Type Solver for Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 10:36-52, 1989. [Young71] D.M.Young. Iterative Solution of Large Linear Systems. New York, Academic Press, Inc., 1971. • VSL [Billor00] Nedret Billor, Ali S. Hadib, and Paul F. Velleman. BACON: blocked adaptive computationally efficient outlier nominators. Computational Statistics & Data Analysis, 34, 279-298, 2000. [Bratley87] Bratley P., Fox B.L., and Schrage L.E. A Guide to Simulation. 2nd edition. Springer-Verlag, New York, 1987. [Bratley88] Bratley P. and Fox B.L. Implementing Sobol's Quasirandom Sequence Generator, ACM Transactions on Mathematical Software, Vol. 14, No. 1, Pages 88-100, March 1988. [Bratley92] Bratley P., Fox B.L., and Niederreiter H. Implementation and Tests of Low- Discrepancy Sequences, ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, Pages 195-213, July 1992. [Coddington94] Coddington, P. D. Analysis of Random Number Generators Using Monte Carlo Simulation. Int. J. Mod. Phys. C-5, 547, 1994. [Gentle98] Gentle, James E. Random Number Generation and Monte Carlo Methods, Springer-Verlag New York, Inc., 1998. [L'Ecuyer94] L'Ecuyer, Pierre. Uniform Random Number Generation. Annals of Operations Research, 53, 77-120, 1994. [L'Ecuyer99] L'Ecuyer, Pierre. Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure. Mathematics of Computation, 68, 225, 249-260, 1999. [L'Ecuyer99a] L'Ecuyer, Pierre. Good Parameter Sets for Combined Multiple Recursive Random Number Generators. Operations Research, 47, 1, 159-164, 1999. [L'Ecuyer01] L'Ecuyer, Pierre. Software for Uniform Random Number Generation: Distinguishing the Good and the Bad. Proceedings of the 2001 Winter Simulation Conference, IEEE Press, 95-105, Dec. 2001. [Kirkpatrick81] Kirkpatrick, S., and Stoll, E. A Very Fast Shift-Register Sequence Random Number Generator. Journal of Computational Physics, V. 40. 517-526, 1981. [Knuth81] Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms. 2nd edition, Addison-Wesley Publishing Company, Reading, Massachusetts, 1981. [Maronna02] Maronna, R.A., and Zamar, R.H., Robust Multivariate Estimates for High- Dimensional Datasets, Technometrics, 44, 307-317, 2002. [Matsumoto98] Matsumoto, M., and Nishimura, T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998. Bibliography G 2705 [Matsumoto00] Matsumoto, M., and Nishimura, T. Dynamic Creation of Pseudorandom Number Generators, 56-69, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Ed. Niederreiter, H. and Spanier, J., Springer 2000, http:// www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/DC/dc.html. [NAG] NAG Numerical Libraries. http://www.nag.co.uk/numeric/ numerical_libraries.asp [Rocke96] David M. Rocke, Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24(3), 1327-1345, 1996. [Saito08] Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator. Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, Pages 607 – 622, 2008. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html [Schafer97] Schafer, J.L., Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997. [Sobol76] Sobol, I.M., and Levitan, Yu.L. The production of points uniformly distributed in a multidimensional cube. Preprint 40, Institute of Applied Mathematics, USSR Academy of Sciences, 1976 (In Russian). [VSL Notes] Intel® MKL Vector Statistical Library Notes, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intel-math-kernellibrary- documentation/ [VSL Data] Intel® MKL Vector Statistical Library Performance, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intel-mathkernel- library-documentation/ • VML [C99] ISO/IEC 9899:1999/Cor 3:2007. Programming languages -- C. [Muller97] J.M.Muller. Elementary functions: algorithms and implementation, Birkhauser Boston, 1997. [IEEE754] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-2008. [VML Data] Intel® MKL Vector Math Library Performance and Accuracy, a document present on the Intel® MKL product at http://software.intel.com/en-us/articles/intelmath- kernel-library-documentation/ • FFT [1] E. Oran Brigham, The Fast Fourier Transform and Its Applications, Prentice Hall, New Jersey, 1988. [2] Athanasios Papoulis, The Fourier Integral and its Applications, 2nd edition, McGraw-Hill, New York, 1984. [3] Ping Tak Peter Tang, DFTI - a new interface for Fast Fourier Transform libraries, ACM Transactions on Mathematical Software, Vol. 31, Issue 4, Pages 475 - 507, 2005. [4] Charles Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, Philadelphia, 1992. • Optimization Solvers [Conn00] A. R. Conn, N. I.M. Gould, P. L. Toint.Trust-region Methods.SIAM Society for Industrial & Applied Mathematics, Englewood Cliffs, New Jersey, MPS-SIAM Series on Optimization edition, 2000. [Dong95] J. Dongarra, V. Eijkhout, A. Kalhan. Reverse communication interface for linear algebra templates for iterative methods.1995. • Data Fitting Functions [deBoor2001] Carl deBoor. A Practical Guide to Splines. Revised Edition. Springer-Verlag New York Berlin Heidelberg, 2001 [StechSub76] S.B. Stechhkin, and Yu Subbotin. Splines in Numerical Mathematics. Izd. Nauka, Moscow, 1976 For a reference implementation of BLAS, sparse BLAS, LAPACK, and ScaLAPACK packages (without platformspecific optimizations) visit www.netlib.org G Intel® Math Kernel Library Reference Manual 2706 Bibliography G 2707 G Intel® Math Kernel Library Reference Manual 2708 Glossary H AH Denotes the conjugate transpose of a general matrix A. See also conjugate matrix. AT Denotes the transpose of a general matrix A. See also transpose. band matrix A general m-by-n matrix A such that aij = 0 for |i - j| > l, where 1 < l < min(m, n). For example, any tridiagonal matrix is a band matrix. band storage A special storage scheme for band matrices. A matrix is stored in a two-dimensional array: columns of the matrix are stored in the corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. BLAS Abbreviation for Basic Linear Algebra Subprograms. These subprograms implement vector, matrix-vector, and matrix-matrix operations. BRNG Abbreviation for Basic Random Number Generator. Basic random number generators are pseudorandom number generators imitating i.i.d. random number sequences of uniform distribution. Distributions other than uniform are generated by applying different transformation techniques to the sequences of random numbers of uniform distribution. BRNG registration Standardized mechanism that allows a user to include a user-designed BRNG into the VSL and use it along with the predefined VSL basic generators. Bunch-Kaufman factorization Representation of a real symmetric or complex Hermitian matrix A in the form A = PUDUHPT (or A = PLDLHPT) where P is a permutation matrix, U and L are upper and lower triangular matrices with unit diagonal, and D is a Hermitian block-diagonal matrix with 1-by-1 and 2-by-2 diagonal blocks. U and L have 2-by-2 unit diagonal blocks corresponding to the 2-by-2 blocks of D. c When found as the first letter of routine names, c indicates the usage of single-precision complex data type. CBLAS C interface to the BLAS. See BLAS. CDF Cumulative Distribution Function. The function that determines probability distribution for univariate or multivariate random variable X. For univariate distribution the cumulative distribution function is the function of real argument x, which for every x takes a value equal to probability of the event A: X = x. For multivariate distribution the cumulative distribution function is the function of a real vector x = (x1,x2, ..., xn), which, for every x, takes a value equal to probability of the event A = (X1 = x1 & X2 = x2, & ..., & Xn = xn). Cholesky factorization Representation of a symmetric positive-definite or, for complex data, Hermitian positive-definite matrix A in the form A = UHU or A = LLH, where L is a lower triangular matrix and U is an upper triangular matrix. condition number The number ?(A) defined for a given square matrix A as follows: ?(A) = ||A|| ||A-1||. conjugate matrix The matrix AH defined for a given general matrix A as follows: (AH)ij = (aji)*. 2709 conjugate number The conjugate of a complex number z = a + bi is z* = a - bi. d When found as the first letter of routine names, d indicates the usage of double-precision real data type. dot product The number denoted x · y and defined for given vectors x and y as follows: x · y = Si xiyi. Here xi and yi stand for the i-th elements of x and y, respectively. double precision A floating-point data type. On Intel® processors, this data type allows you to store real numbers x such that 2.23*10-308< | x | < 1.79*10308. For this data type, the machine precision e is approximately 10-15, which means that double-precision numbers usually contain no more than 15 significant decimal digits. For more information, refer to Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture. eigenvalue See eigenvalue problem. eigenvalue problem A problem of finding non-zero vectors x and numbers ? (for a given square matrix A) such that Ax = ?x. Here the numbers ? are called the eigenvalues of the matrix A and the vectors x are called the eigenvectors of the matrix A. eigenvector See eigenvalue problem. elementary reflector(Householder matrix) Matrix of a general form H = I - tvvT, where v is a column vector and t is a scalar. In LAPACK elementary reflectors are used, for example, to represent the matrix Q in the QR factorization (the matrix Q is represented as a product of elementary reflectors). factorization Representation of a matrix as a product of matrices. See also Bunch- Kaufman factorization, Cholesky factorization, LU factorization, LQ factorization, QR factorization, Schur factorization. FFTs Abbreviation for Fast Fourier Transforms. See Chapter 11 of this book. full storage A storage scheme allowing you to store matrices of any kind. A matrix A is stored in a two-dimensional array a, with the matrix element aij stored in the array element a(i,j). Hermitian matrix A square matrix A that is equal to its conjugate matrix AH. The conjugate AH is defined as follows: (AH)ij = (aji)*. I See identity matrix. identity matrix A square matrix I whose diagonal elements are 1, and off-diagonal elements are 0. For any matrix A, AI = A and IA = A. i.i.d. Independent Identically Distributed. in-place Qualifier of an operation. A function that performs its operation inplace takes its input from an array and returns its output to the same array. Intel MKL Abbreviation for Intel® Math Kernel Library. inverse matrix The matrix denoted as A-1 and defined for a given square matrix A as follows: AA-1 = A-1A = I. A-1 does not exist for singular matrices A. LQ factorization Representation of an m-by-n matrix A as A = LQ or A = (L 0)Q. Here Q is an n-by-n orthogonal (unitary) matrix. For m = n, L is an m-by-m lower triangular matrix with real diagonal elements; for m > n, where L1 is an n-by-n lower triangular matrix, and L2 is a rectangular matrix. H Intel® Math Kernel Library Reference Manual 2710 LU factorization Representation of a general m-by-n matrix A as A = PLU, where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) and U is upper triangular (upper trapezoidal if m < n). machine precision The number e determining the precision of the machine representation of real numbers. For Intel® architecture, the machine precision is approximately 10-7 for single-precision data, and approximately 10-15 for double-precision data. The precision also determines the number of significant decimal digits in the machine representation of real numbers. See also double precision and single precision. MPI Message Passing Interface. This standard defines the user interface and functionality for a wide range of message-passing capabilities in parallel computing. MPICH A freely available, portable implementation of MPI standard for message-passing libraries. orthogonal matrix A real square matrix A whose transpose and inverse are equal, that is, AT = A-1, and therefore AAT = ATA = I. All eigenvalues of an orthogonal matrix have the absolute value 1. packed storage A storage scheme allowing you to store symmetric, Hermitian, or triangular matrices more compactly. The upper or lower triangle of a matrix is packed by columns in a one-dimensional array. PDF Probability Density Function. The function that determines probability distribution for univariate or multivariate continuous random variable X. The probability density function f(x) is closely related with the cumulative distribution function F(x). For univariate distribution the relation is For multivariate distribution the relation is positive-definite matrix A square matrix A such that Ax · x > 0 for any non-zero vector x. Here · denotes the dot product. pseudorandom number generator A completely deterministic algorithm that imitates truly random sequences. QR factorization Representation of an m-by-n matrix A as A = QR, where Q is an m-by-m orthogonal (unitary) matrix, and R is n-by-n upper triangular with real diagonal elements (if m = n) or trapezoidal (if m < n) matrix. random stream An abstract source of independent identically distributed random numbers of uniform distribution. In this manual a random stream points to a structure that uniquely defines a random number sequence generated by a basic generator associated with a given random stream. RNG Abbreviation for Random Number Generator. In this manual the term "random number generators" stands for pseudorandom number generators, that is, generators based on completely deterministic algorithms imitating truly random sequences. Glossary H 2711 Rectangular Full Packed (RFP) storage A storage scheme combining the full and packed storage schemes for the upper or lower triangle of the matrix. This combination enables using half of the full storage as packed storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels as the full storage. s When found as the first letter of routine names, s indicates the usage of single-precision real data type. ScaLAPACK Stands for Scalable Linear Algebra PACKage. Schur factorization Representation of a square matrix A in the form A = ZTZH. Here T is an upper quasi-triangular matrix (for complex A, triangular matrix) called the Schur form of A; the matrix Z is orthogonal (for complex A, unitary). Columns of Z are called Schur vectors. single precision A floating-point data type. On Intel® processors, this data type allows you to store real numbers x such that 1.18*10-38 < | x | < 3.40*1038. For this data type, the machine precision (e) is approximately 10-7, which means that single-precision numbers usually contain no more than 7 significant decimal digits. For more information, refer to Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture. singular matrix A matrix whose determinant is zero. If A is a singular matrix, the inverse A-1 does not exist, and the system of equations Ax = b does not have a unique solution (that is, there exist no solutions or an infinite number of solutions). singular value The numbers defined for a given general matrix A as the eigenvalues of the matrix AAH. See also SVD. SMP Abbreviation for Symmetric MultiProcessing. The MKL offers performance gains through parallelism provided by the SMP feature. sparse BLAS Routines performing basic vector operations on sparse vectors. Sparse BLAS routines take advantage of vectors' sparsity: they allow you to store only non-zero elements of vectors. See BLAS. sparse vectors Vectors in which most of the components are zeros. storage scheme The way of storing matrices. See full storage, packed storage, and band storage. SVD Abbreviation for Singular Value Decomposition. See also Singular value decomposition section in Chapter 5. symmetric matrix A square matrix A such that aij = aji. transpose The transpose of a given matrix A is a matrix AT such that (AT)ij = aji (rows of A become columns of AT, and columns of A become rows of AT). trapezoidal matrix A matrix A such that A = (A1A2), where A1 is an upper triangular matrix, A2 is a rectangular matrix. triangular matrix A matrix A is called an upper (lower) triangular matrix if all its subdiagonal elements (superdiagonal elements) are zeros. Thus, for an upper triangular matrix aij = 0 when i > j; for a lower triangular matrix aij = 0 when i < j. tridiagonal matrix A matrix whose non-zero elements are in three diagonals only: the leading diagonal, the first subdiagonal, and the first super-diagonal. unitary matrix A complex square matrix A whose conjugate and inverse are equal, that is, that is, AH = A-1, and therefore AAH = AHA = I. All eigenvalues of a unitary matrix have the absolute value 1. VML Abbreviation for Vector Mathematical Library. See Chapter 9 of this book. VSL Abbreviation for Vector Statistical Library. See Chapter 10 of this book. z When found as the first letter of routine names, z indicates the usage of double-precision complex data type. H Intel® Math Kernel Library Reference Manual 2712 Index ?_backward_trig_transform 2450 ?_commit_Helmholtz_2D 2467 ?_commit_Helmholtz_3D 2467 ?_commit_sph_np 2476 ?_commit_sph_p 2476 ?_commit_trig_transform 2446 ?_forward_trig_transform 2448 ?_Helmholtz_2D 2470 ?_Helmholtz_3D 2470 ?_init_Helmholtz_2D 2465 ?_init_Helmholtz_3D 2465 ?_init_sph_np 2475 ?_init_sph_p 2475 ?_init_trig_transform 2445 ?_sph_np 2478 ?_sph_p 2478 ?asum 54 ?axpby 327 ?axpy 55 ?axpyi 141 ?bdsdc 756 ?bdsqr 752 ?cabs1 73 ?ConvExec 2239 ?ConvExec1D 2242 ?ConvExecX 2246 ?ConvExecX1D 2249 ?ConvNewTask 2220 ?ConvNewTask1D 2223 ?ConvNewTaskX 2225 ?copy 56 ?CorrExec 2239 ?CorrExec1D 2242 ?CorrExecX 2246 ?CorrExecX1D 2249 ?CorrNewTask 2220 ?CorrNewTask1D 2223 ?CorrNewTaskX 2225 ?CorrNewTaskX1D 2228 ?dbtf2 1872 ?dbtrf 1873 ?disna 818 ?dot 58 ?dotc 60 ?dotci 144 ?doti 143 ?dotu 61 ?dotui 145 ?dtsvb 595 ?dttrf 1874 ?dttrfb 363 ?dttrsb 392 ?dttrsv 1875 ?gamn2d 2552 ?gamx2d 2551 ?gbbrd 739 ?gbcon 422 ?gbequ 542 ?gbequb 545 ?gbmv 75 ?gbrfs 458 ?gbrfsx 461 ?gbsv 574 ?gbsvx 576 ?gbsvxx 582 ?gbtf2 1166 ?gbtrf 359 ?gbtrs 387 ?gebak 849 ?gebal 847 ?gebd2 1167 ?gebr2d 2561 ?gebrd 736 ?gebs2d 2560 ?gecon 420 ?geequ 538 ?geequb 540 ?gees 1020 ?geesx 1024 ?geev 1028 ?geevx 1032 ?gehd2 1168 ?gehrd 835 ?gejsv 1045 ?gelq2 1170 ?gelqf 689 ?gels 930 ?gelsd 939 ?gelss 937 ?gelsy 933 ?gem2vc 331 ?gem2vu 329 ?gemm 119 ?gemm3m 333 ?gemv 77 ?geql2 1171 ?geqlf 700 ?geqp3 678 ?geqpf 676 ?geqr2 1172 ?geqr2p 1174 ?geqrf 671 ?geqrfp 674 ?ger 79 ?gerc 81 ?gerfs 449 ?gerfsx 452 ?gerq2 1175 ?gerqf 710 ?geru 82 ?gerv2d 2557 ?gesc2 1176 ?gesd2d 2556 ?gesdd 1041 ?gesv 558 ?gesvd 1037 ?gesvj 1051 ?gesvx 561 ?gesvxx 567 ?getc2 1177 ?getf2 1178 ?getrf 357 ?getri 514 ?getrs 385 ?ggbak 883 ?ggbal 880 ?gges 1121 ?ggesx 1126 ?ggev 1132 ?ggevx 1136 Index 2713 ?ggglm 946 ?gghrd 878 ?gglse 943 ?ggqrf 728 ?ggrqf 731 ?ggsvd 1055 ?ggsvp 910 ?gsum2d 2553 ?gsvj0 1432 ?gsvj1 1434 ?gtcon 424 ?gthr 146 ?gthrz 147 ?gtrfs 467 ?gtsv 589 ?gtsvx 591 ?gttrf 361 ?gttrs 389 ?gtts2 1179 ?hbev 993 ?hbevd 998 ?hbevx 1004 ?hbgst 829 ?hbgv 1105 ?hbgvd 1110 ?hbgvx 1117 ?hbtrd 791 ?hecon 438 ?heequb 556 ?heev 951 ?heevd 956 ?heevr 970 ?heevx 963 ?heft2 1419 ?hegst 822 ?hegv 1068 ?hegvd 1074 ?hegvx 1081 ?hemm 122 ?hemv 86 ?her 87 ?her2 89 ?her2k 126 ?herdb 766 ?herfs 494 ?herfsx 496 ?herk 124 ?hesv 642 ?hesvx 645 ?hesvxx 649 ?heswapr 1413 ?hetrd 772 ?hetrf 378 ?hetri 522 ?hetri2 525 ?hetri2x 529 ?hetrs 404 ?hetrs2 408 ?hfrk 1438 ?hgeqz 885 ?hpcon 441 ?hpev 977 ?hpevd 981 ?hpevx 988 ?hpgst 825 ?hpgv 1087 ?hpgvd 1092 ?hpgvx 1099 ?hpmv 91 ?hpr 92 ?hpr2 94 ?hprfs 504 ?hpsv 661 ?hpsvx 663 ?hptrd 784 ?hptrf 383 ?hptri 532 ?hptrs 411 ?hsein 855 ?hseqr 851 ?isnan 1180 ?jacobi 2515 ?jacobi_delete 2514 ?jacobi_init 2512 ?jacobi_solve 2513 ?jacobix 2516 ?la_gbamv 1455 ?la_gbrcond 1457 ?la_gbrcond_c 1459 ?la_gbrcond_x 1460 ?la_gbrfsx_extended 1462 ?la_gbrpvgrw 1467 ?la_geamv 1468 ?la_gercond 1470 ?la_gercond_c 1471 ?la_gercond_x 1472 ?la_gerfsx_extended 1473 ?la_heamv 1478 ?la_hercond_c 1480 ?la_hercond_x 1481 ?la_herfsx_extended 1482 ?la_herpvgrw 1487 ?la_porcond 1489 ?la_porcond_c 1490 ?la_porcond_x 1492 ?la_porfsx_extended 1493 ?la_porpvgrw 1498 ?la_rpvgrw 1503 ?la_syamv 1431, 1505 ?la_syrcond 1507 ?la_syrcond_c 1508 ?la_syrcond_x 1509 ?la_syrfsx_extended 1511 ?la_syrpvgrw 1516 ?la_wwaddw 1517 ?labrd 1181 ?lacgv 1155 ?lacn2 1184 ?lacon 1185 ?lacp2 1455 ?lacpy 1186 ?lacrm 1156 ?lacrt 1156 ?ladiv 1187 ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 ?laein 1209 ?laesy 1157 ?laev2 1212 Intel® Math Kernel Library Reference Manual 2714 ?laexc 1213 ?lag2 1214 ?lags2 1216 ?lagtf 1218 ?lagtm 1220 ?lagts 1221 ?lagv2 1223 ?lahef 1378 ?lahqr 1224 ?lahr2 1228 ?lahrd 1226 ?laic1 1230 ?laisnan 1181 ?laln2 1232 ?lals0 1234 ?lalsa 1236 ?lalsd 1239 ?lamc1 1526 ?lamc2 1526 ?lamc3 1527 ?lamc4 1528 ?lamc5 1528 ?lamch 1525 ?lamrg 1241 ?lamsh 1866 ?laneg 1242 ?langb 1243 ?lange 1244 ?langt 1245 ?lanhb 1248 ?lanhe 1253 ?lanhf 1443 ?lanhp 1250 ?lanhs 1246 ?lansb 1247 ?lansf 1442 ?lansp 1249 ?lanst/?lanht 1251 ?lansy 1252 ?lantb 1255 ?lantp 1256 ?lantr 1257 ?lanv2 1259 ?lapll 1259 ?lapmr 1260 ?lapmt 1262 ?lapy2 1262 ?lapy3 1263 ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqp2 1268 ?laqps 1269 ?laqr0 1270 ?laqr1 1273 ?laqr2 1274 ?laqr3 1277 ?laqr4 1280 ?laqr5 1282 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?laqtr 1289 ?lar1v 1290 ?lar2v 1293 ?larcm 1502 ?laref 1867 ?larf 1294 ?larfb 1295 ?larfg 1298 ?larfgp 1299 ?larfp 1429 ?larft 1300 ?larfx 1302 ?largv 1304 ?larnv 1305 ?larra 1306 ?larrb 1307 ?larrc 1309 ?larrd 1310 ?larre 1312 ?larrf 1315 ?larrj 1317 ?larrk 1318 ?larrr 1319 ?larrv 1320 ?larscl2 1504 ?lartg 1323 ?lartgp 1324 ?lartgs 1326 ?lartv 1327 ?laruv 1328 ?larz 1329 ?larzb 1330 ?larzt 1332 ?las2 1334 ?lascl 1335 ?lascl2 1504 ?lasd0 1336 ?lasd1 1338 ?lasd2 1340 ?lasd3 1342 ?lasd4 1344 ?lasd5 1346 ?lasd6 1347 ?lasd7 1350 ?lasd8 1353 ?lasd9 1354 ?lasda 1356 ?lasdq 1358 ?lasdt 1360 ?laset 1361 ?lasorte 1868 ?lasq1 1362 ?lasq2 1363 ?lasq3 1364 ?lasq4 1365 ?lasq5 1366 ?lasq6 1367 ?lasr 1368 ?lasrt 1371 ?lasrt2 1869 ?lassq 1372 ?lasv2 1373 ?laswp 1374 ?lasy2 1375 ?lasyf 1377 ?latbs 1380 ?latdf 1382 ?latps 1383 ?latrd 1385 ?latrs 1387 ?latrz 1390 ?lauu2 1392 ?lauum 1393 ?nrm2 62 ?opgtr 781 ?opmtr 782 Index 2715 ?orbdb/?unbdb 925 ?orcsd/?uncsd 1060 ?org2l/?ung2l 1394 ?org2r/?ung2r 1395 ?orgbr 742 ?orghr 837 ?orgl2/?ungl2 1396 ?orglq 692 ?orgql 702 ?orgqr 681 ?orgr2/?ungr2 1397 ?orgrq 712 ?orgtr 768 ?orm2l/?unm2l 1399 ?orm2r/?unm2r 1400 ?ormbr 744 ?ormhr 839 ?orml2/?unml2 1402 ?ormlq 694 ?ormql 706 ?ormqr 683 ?ormr2/?unmr2 1404 ?ormr3/?unmr3 1405 ?ormrq 716 ?ormrz 723 ?ormtr 770 ?pbcon 430 ?pbequ 552 ?pbrfs 480 ?pbstf 831 ?pbsv 617 ?pbsvx 619 ?pbtf2 1407 ?pbtrf 371 ?pbtrs 398 ?pftrf 368 ?pftri 517 ?pftrs 395 ?pocon 426 ?poequ 547 ?poequb 549 ?porfs 469 ?porfsx 472 ?posv 596 ?posvx 599 ?posvxx 604 ?potf2 1408 ?potrf 364 ?potri 516 ?potrs 393 ?ppcon 428 ?ppequ 550 ?pprfs 478 ?ppsv 611 ?ppsvx 612 ?pptrf 369 ?pptri 519 ?pptrs 396 ?pstf2 1451 ?pstrf 366 ?ptcon 432 ?pteqr 810 ?ptrfs 483 ?ptsv 623 ?ptsvx 625 ?pttrf 373 ?pttrs 400 ?pttrsv 1876 ?ptts2 1409 ?rot 63, 1158 ?rotg 64 ?roti 148 ?rotm 65 ?rotmg 67 ?rscl 1411 ?sbev 991 ?sbevd 995 ?sbevx 1001 ?sbgst 827 ?sbgv 1103 ?sbgvd 1107 ?sbgvx 1113 ?sbmv 95 ?sbtrd 789 ?scal 69 ?sctr 149 ?sdot 59 ?sfrk 1437 ?spcon 439 ?spev 975 ?spevd 979 ?spevx 985 ?spgst 823 ?spgv 1085 ?spgvd 1089 ?spgvx 1096 ?spmv 98, 1159 ?spr 99, 1161 ?spr2 101 ?sprfs 501 ?spsv 655 ?spsvx 657 ?sptrd 779 ?sptrf 381 ?sptri 530 ?sptrs 409 ?stebz 813 ?stedc 801 ?stegr 805 ?stein 815 ?stemr 798 ?steqr 795 ?steqr2 1878 ?sterf 793 ?stev 1008 ?stevd 1009 ?stevr 1015 ?stevx 1012 ?sum1 1165 ?swap 70 ?sycon 434 ?syconv 436 ?syequb 554 ?syev 949 ?syevd 954 ?syevr 966 ?syevx 959 ?sygs2/?hegs2 1415 ?sygst 820 ?sygv 1066 ?sygvd 1071 ?sygvx 1077 ?symm 128 ?symv 102, 1162 ?syr 104, 1163 ?syr2 106 ?syr2k 133 ?syrdb 764 ?syrfs 485 ?syrfsx 488 Intel® Math Kernel Library Reference Manual 2716 ?syrk 131 ?sysv 629 ?sysvx 631 ?sysvxx 635 ?syswapr 1411 ?syswapr1 1414 ?sytd2/?hetd2 1417 ?sytf2 1418 ?sytrd 762 ?sytrf 374 ?sytri 520 ?sytri2 523 ?sytri2x 527 ?sytrs 402 ?sytrs2 406 ?tbcon 447 ?tbmv 107 ?tbsv 109 ?tbtrs 418 ?tfsm 1440 ?tftri 535 ?tfttp 1444 ?tfttr 1445 ?tgevc 890 ?tgex2 1421 ?tgexc 894 ?tgsen 896 ?tgsja 914 ?tgsna 906 ?tgsy2 1423 ?tgsyl 902 ?tpcon 445 ?tpmv 112 ?tprfs 508 ?tpsv 113 ?tptri 536 ?tptrs 416 ?tpttf 1446 ?tpttr 1448 ?trbr2d 2562 ?trbs2d 2560 ?trcon 443 ?trevc 860 ?trexc 868 ?trmm 135 ?trmv 115 ?trnlsp_check 2499 ?trnlsp_delete 2503 ?trnlsp_get 2502 ?trnlsp_init 2497 ?trnlsp_solve 2500 ?trnlspbc_check 2506 ?trnlspbc_delete 2511 ?trnlspbc_get 2510 ?trnlspbc_init 2505 ?trnlspbc_solve 2508 ?trrfs 506 ?trrv2d 2558 ?trsd2d 2557 ?trsen 870 ?trsm 138 ?trsna 864 ?trsv 117 ?trsyl 874 ?trti2 1426 ?trtri (LAPACK) 534 ?trtrs (LAPACK) 413 ?trttf 1449 ?trttp 1450 ?tzrzf 720 ?ungbr 747 ?unghr 842 ?unglq 696 ?ungql 704 ?ungqr 685 ?ungrq 714 ?ungtr 775 ?unmbr 749 ?unmhr 844 ?unmlq 698 ?unmql 708 ?unmqr 687 ?unmrq 718 ?unmrz 725 ?unmtr 776 ?upgtr 786 ?upmtr 787 1-norm value complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 A absolute value of a vector element largest 71 smallest 72 accuracy modes, in VML 1969 adding magnitudes of elements of a distributed vector 2377 adding magnitudes of the vector elements 54 arguments matrix 2646 sparse vector 140 vector 2645 array descriptor 1535, 2373 auxiliary functions ?la_lin_berr 1488 auxiliary routines LAPACK ScaLAPACK 1739 B backward error 1488 balancing a matrix 847 band storage scheme 2646 basic quasi-number generator Niederreiter 2121 Sobol 2121 Index 2717 basic random number generators GFSR 2121 MCG, 32-bit 2121 MCG, 59-bit 2121 Mersenne Twister MT19937 2121 MT2203 2121 MRG 2121 Wichmann-Hill 2121 bdsdc 756 Bernoulli 2195 Beta 2186 bidiagonal matrix LAPACK 734 ScaLAPACK 1666 Binomial 2198 bisection 1307 BLACS broadcast 2559 combines 2550 destruction routines 2568 informational routines 2570 initialization routines 2562 miscellaneous routines 2571 point to point communication 2554 ?gamn2d 2552 ?gamx2d 2551 ?gebr2d 2561 ?gebs2d 2560 ?gerv2d 2557 ?gesd2d 2556 ?gsum2d 2553 ?trbr2d 2562 ?trbs2d 2560 ?trrv2d 2558 ?trsd2d 2557 blacs_abort 2569 blacs_barrier 2571 blacs_exit 2569 blacs_freebuff 2568 blacs_get 2564 blacs_gridexit 2569 blacs_gridinfo 2570 blacs_gridinit 2566 blacs_gridmap 2567 blacs_pcoord 2571 blacs_pinfo 2563 blacs_pnum 2570 blacs_set 2565 blacs_setup 2563 usage examples 2572 BLACS routines matrix shapes 2549 blacs_abort 2569 blacs_barrier 2571 blacs_exit 2569 blacs_freebuff 2568 blacs_get 2564 blacs_gridexit 2569 blacs_gridinfo 2570 blacs_gridinit 2566 blacs_gridmap 2567 blacs_pcoord 2571 blacs_pinfo 2563 blacs_pnum 2570 blacs_set 2565 blacs_setup 2563 BLAS Code Examples 2653 BLAS Level 1 routines ?asum 53, 54 ?axpby 327 ?axpy 53, 55 ?cabs1 53, 73 ?copy 53, 56 ?dot 53, 58 ?dotc 53, 60 ?dotu 53, 61 ?nrm2 53, 62 ?rot 53, 63 ?rotg 53, 64 ?rotm 53, 65 ?rotmg 67 ?rotmq 53 ?scal 53, 69 ?sdot 53, 59 ?swap 53, 70 code example 2653 i?amax 53, 71 i?amin 53, 72 BLAS Level 2 routines ?gbmv 74, 75 ?gem2vc 331 ?gem2vu 329 ?gemv 74, 77 ?ger 74, 79 ?gerc 74, 81 ?geru 74, 82 ?hbmv 74, 84 ?hemv 74, 86 ?her 74, 87 ?her2 74, 89 ?hpmv 74, 91 ?hpr 74, 92 ?hpr2 74, 94 ?sbmv 74, 95 ?spmv 74, 98 ?spr 74, 99 ?spr2 74, 101 ?symv 74, 102 ?syr 74, 104 ?syr2 74, 106 ?tbmv 74, 107 ?tbsv 74, 109 ?tpmv 74, 112 ?tpsv 74, 113 ?trmv 74, 115 ?trsv 74, 117 code example 2654 BLAS Level 3 routines ?gemm 118, 119 ?gemm3m 333 ?hemm 118, 122 ?her2k 118, 126 ?herk 118, 124 ?symm 118, 128 ?syr2k 118, 133 ?syrk 118, 131 ?tfsm 1440 ?trmm 118, 135 ?trsm 118, 138 code example 2654 BLAS routines routine groups BLAS-like extensions 327 BLAS-like transposition routines mkl_?imatcopy 335 mkl_?omatadd 344 mkl_?omatcopy 338 mkl_?omatcopy2 341 block reflector Intel® Math Kernel Library Reference Manual 2718 general matrix LAPACK 1330 ScaLAPACK 1807 general rectangular matrix LAPACK 1295 ScaLAPACK 1795 triangular factor LAPACK 1300, 1332 ScaLAPACK 1802, 1813 block-cyclic distribution 1535, 2373 block-splitting method 2121 BRNG 2115, 2116, 2121 Bunch-Kaufman factorization Hermitian matrix packed storage 383 symmetric matrix packed storage 381 C C Datatypes 49 C interface conventions LAPACK 348 Cauchy 2173 cbbcsd 920 CBLAS arguments 2669 level 1 (vector operations) 2670 level 2 (matrix-vector operations) 2672 level 3 (matrix-matrix operations) 2676 sparse BLAS 2678 CBLAS to the BLAS 2669 cgbcon 422 cgbrfsx 461 cgbsvx 576 cgbtrs 387 cgecon 420 cgeqpf 676 cgtrfs 467 chegs2 1415 cheswapr 1413 chetd2 1417 chetri2 525 chetri2x 529 chetrs2 408 chgeqz 885 chla_transtype 1529 Cholesky factorization Hermitian positive semi-definite matrix 1451 Hermitian positive semidefinite matrix 366 Hermitian positive-definite matrix band storage 371, 398, 619, 1546, 1558 packed storage 369, 612 split 831 symmetric positive semi-definite matrix 1451 symmetric positive semidefinite matrix 366 symmetric positive-definite matrix band storage 371, 398, 619, 1546, 1558 packed storage 369, 612 chseqr 851 cla_gbamv 1455 cla_gbrcond_c 1459 cla_gbrcond_x 1460 cla_gbrfsx_extended 1462 cla_gbrpvgrw 1467 cla_geamv 1468 cla_gercond_c 1471 cla_gercond_x 1472 cla_gerfsx_extended 1473 cla_heamv 1478 cla_hercond_c 1480 cla_hercond_x 1481 cla_herfsx_extended 1482 cla_herpvgrw 1487 cla_lin_berr 1488 cla_porcond_c 1490 cla_porcond_x 1492 cla_porfsx_extended 1493 cla_porpvgrw 1498 cla_rpvgrw 1503 cla_syamv 1505 cla_syrcond_c 1508 cla_syrcond_x 1509 cla_syrfsx_extended 1511 cla_syrpvgrw 1516 cla_wwaddw 1517 clag2z 1427 clapmr 1260 clapmt 1262 clarfb 1295 clarft 1300 clarscl2 1504 clascl2 1504 clatps 1383 clatrd 1385 clatrs 1387 clatrz 1390 clauu2 1392 clauum 1393 code examples BLAS Level 1 function 2653 BLAS Level 1 routine 2653 BLAS Level 2 routine 2654 BLAS Level 3 routine 2654 communication subprograms complex division in real arithmetic 1187 complex Hermitian matrix 1-norm value LAPACK 1253 ScaLAPACK 1782 factorization with diagonal pivoting method 1419 Frobenius norm LAPACK 1253 ScaLAPACK 1782 infinity- norm LAPACK 1253 ScaLAPACK 1782 largest absolute value of element LAPACK 1253 ScaLAPACK 1782 complex Hermitian matrix in packed form 1-norm value 1250 Frobenius norm 1250 infinity- norm 1250 largest absolute value of element 1250 complex Hermitian tridiagonal matrix 1-norm value 1251 Frobenius norm 1251 infinity- norm 1251 largest absolute value of element 1251 complex matrix complex elementary reflector ScaLAPACK 1809 complex symmetric matrix 1-norm value 1252 Frobenius norm 1252 infinity- norm 1252 largest absolute value of element 1252 complex vector 1-norm using true absolute value Index 2719 LAPACK 1165 ScaLAPACK 1745 conjugation LAPACK 1155 ScaLAPACK 1743 complex vector conjugation LAPACK 1155 ScaLAPACK 1743 component-wise relative error 1488 compressed sparse vectors 140 computational node 2117 Computational Routines 669 condition number band matrix 422 general matrix LAPACK 420 ScaLAPACK 1564, 1566, 1568 Hermitian matrix packed storage 441 Hermitian positive-definite matrix band storage 430 packed storage 428 tridiagonal 432 symmetric matrix packed storage 439 symmetric positive-definite matrix band storage 430 packed storage 428 tridiagonal 432 triangular matrix band storage 447 packed storage 445 tridiagonal matrix 424 configuration parameters, in FFT interface 2313 Configuration Settings, for Fourier transform functions 2332 Continuous Distribution Generators 2153 Continuous Distributions 2156 ConvCopyTask 2254 ConvDeleteTask 2253 converting a DOUBLE COMPLEX triangular matrix to COMPLEX 1454 converting a double-precision triangular matrix to singleprecision 1453 converting a sparse vector into compressed storage form and writing zeros to the original vector 147 converting compressed sparse vectors into full storage form 149 ConvInternalPrecision 2234 Convolution and Correlation 2214 Convolution Functions ?ConvExec 2239 ?ConvExec1D 2242 ?ConvExecX 2246 ?ConvExecX1D 2249 ?ConvNewTask 2220 ?ConvNewTask1D 2223 ?ConvNewTaskX 2225 ?ConvNewTaskX1D 2228 ConvCopyTask 2254 ConvDeleteTask 2253 ConvSetDecimation 2237 ConvSetInternalPrecision 2234 ConvSetMode 2232 ConvSetStart 2235 CorrCopyTask 2254 CorrDeleteTask 2253 ConvSetMode 2232 ConvSetStart 2235 copying distributed vectors 2379 matrices distributed 1770 global parallel 1772 local replicated 1772 two-dimensional LAPACK 1186, 1455 ScaLAPACK 1773 vectors 56 copying a matrix 1444–1446, 1448–1450 CopyStream 2138 CopyStreamState 2139 CorrCopyTask 2254 CorrDeleteTask 2253 Correlation Functions ?CorrExec 2239 ?CorrExec1D 2242 ?CorrExecX 2246 ?CorrExecX1D 2249 ?CorrNewTask 2220 ?CorrNewTask1D 2223 ?CorrNewTaskX 2225 ?CorrNewTaskX1D 2228 CorrSetDecimation 2237 CorrSetInternalPrecision 2234 CorrSetMode 2232 CorrSetStart 2235 CorrSetInternalDecimation 2237 CorrSetInternalPrecision 2234 CorrSetMode 2232 CorrSetStart 2235 cosine-sine decomposition LAPACK 919, 1060 cpbtf2 1407 cporfsx 472 cpotf2 1408 cpprfs 478 cpptrs 396 cptts2 1409 Cray 1879 crscl 1411 cs decomposition See also LAPACK routines, cs decomposition 919 CSD (cosine-sine decomposition) LAPACK 919, 1060 csyconv 436 csyswapr 1411 csyswapr1 1414 csytf2 1418 csytri2 523 csytri2x 527 csytrs2 406 ctgex2 1421 ctgsy2 1423 ctrexc 868 ctrti2 1426 cunbdb 925 cuncsd 1060 cung2l 1394 cung2r 1395 cungbr 747 cungl2 1396 cungr2 1397 cunm2l 1399 cunm2r 1400 cunml2 1402 cunmr2 1404 cunmr3 1405 Intel® Math Kernel Library Reference Manual 2720 D data type in VML 1969 shorthand 41 Data Types 2124 Datatypes, C language 49 dbbcsd 920 dbdsdc 756 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 DeleteStream 2137 descriptor configuration cluster FFT 2356 descriptor manipulation cluster FFT 2356 DF task dfdconstruct1d 2606 dfdConstruct1D 2606 dfdeditidxptr 2604 dfdEditIdxPtr 2604 dfdeditppspline1d 2595 dfdEditPPSpline1D 2595 dfdeditptr 2601 dfdEditPtr 2601 dfdeletetask 2627 dfDeleteTask 2627 dfdintegrate1d 2613 dfdIntegrate1D 2613 dfdintegrateex1d 2613 dfdIntegrateEx1D 2613 dfdintegrcallback 2623 dfdIntegrCallBack 2623 dfdinterpcallback 2621 dfdInterpCallBack 2621 dfdinterpolate1d 2607 dfdInterpolate1D 2607 dfdinterpolateex1d 2607 dfdInterpolateEx1D 2607 dfdnewtask1d 2592 dfdNewTask1D 2592 dfdsearchcells1d 2619 dfdSearchCells1D 2619 dfdsearchcellscallback 2625 dfdSearchCellsCallBack 2625 dfdsearchcellsex1d 2619 dfdSearchCellsEx1D 2619 dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 dfieditptr 2601 dfiEditPtr 2601 dfieditval 2602 dfiEditVal 2602 dfsconstruct1d 2606 dfsConstruct1D 2606 dfseditidxptr 2604 dfsEditIdxPtr 2604 dfseditppspline1d 2595 dfsEditPPSpline1D 2595 dfseditptr 2601 dfsEditPtr 2601 dfsintegrate1d 2613 dfsIntegrate1D 2613 dfsintegrateex1d 2613 dfsIntegrateEx1D 2613 dfsintegrcallback 2623 dfsIntegrCallBack 2623 dfsinterpcallback 2621 dfsInterpCallBack 2621 dfsinterpolate1d 2607 dfsInterpolate1D 2607 dfsinterpolateex1d 2607 dfsInterpolateEx1D 2607 dfsnewtask1d 2592 dfsNewTask1D 2592 dfssearchcells1d 2619 dfsSearchCells1D 2619 dfssearchcellscallback 2625 dfsSearchCellsCallBack 2625 dfssearchcellsex1d 2619 dfsSearchCellsEx1D 2619 DFT routines descriptor configuration DftiSetValue 2325 DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiComputeBackward 2322 DftiComputeBackwardDM 2362 DftiComputeForward 2320 DftiComputeForwardDM 2360 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiErrorClass 2329 DftiErrorMessage 2331 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValue 2325 DftiSetValueDM 2365 dgbcon 422 dgbrfsx 461 dgbsvx 576 dgbtrs 387 dgecon 420 dgejsv 1045 dgeqpf 676 dgesvj 1051 dgtrfs 467 dhgeqz 885 dhseqr 851 diagonal elements LAPACK 1361 ScaLAPACK 1817 diagonal pivoting factorization Hermitian indefinite matrix 649 symmetric indefinite matrix 635 diagonally dominant tridiagonal matrix solving systems of linear equations 392 diagonally dominant-like banded matrix solving systems of linear equations 1553 diagonally dominant-like tridiagonal matrix solving systems of linear equations 1555 dimension 2645 Direct Sparse Solver (DSS) Interface Routines 1914 Discrete Distribution Generators 2153, 2154 Discrete Distributions 2189 Discrete Fourier Transform DftiSetValue 2325 distributed complex matrix transposition 2433, 2434 distributed general matrix matrix-vector product 2387, 2389 rank-1 update 2391 rank-1 update, unconjugated 2394 Index 2721 rank-l update, conjugated 2393 distributed Hermitian matrix matrix-vector product 2396, 2397 rank-1 update 2399 rank-2 update 2400 rank-k update 2422 distributed matrix equation AX = B 2437 distributed matrix-matrix operation rank-k update distributed Hermitian matrix 2422 transposition complex matrix 2433 complex matrix, conjugated 2434 real matrix 2432 distributed matrix-vector operation product Hermitian matrix 2396, 2397 symmetric matrix 2402, 2404 triangular matrix 2409, 2410 rank-1 update Hermitian matrix 2399 symmetric matrix 2406 rank-1 update, conjugated 2393 rank-1 update, unconjugated 2394 rank-2 update Hermitian matrix 2400 symmetric matrix 2407 distributed real matrix transposition 2432 distributed symmetric matrix matrix-vector product 2402, 2404 rank-1 update 2406 rank-2 update 2407 distributed triangular matrix matrix-vector product 2409, 2410 solving systems of linear equations 2413 distributed vector-scalar product 2384 distributed vectors adding magnitudes of vector elements 2377 copying 2379 dot product complex vectors 2382 complex vectors, conjugated 2381 real vectors 2380 Euclidean norm 2383 global index of maximum element 2376 linear combination of vectors 2378 sum of vectors 2378 swapping 2385 vector-scalar product 2384 distributed-memory computations Distribution Generators 2153 Distribution Generators Supporting Accurate Mode 2154 divide and conquer algorithm 1706, 1715 djacobi 2515 djacobi_delete 2514 djacobi_init 2512 djacobi_solve 2513 djacobix 2516 dla_gbamv 1455 dla_gbrcond 1457 dla_gbrfsx_extended 1462 dla_gbrpvgrw 1467 dla_geamv 1468 dla_gercond 1470 dla_gerfsx_extended 1473 dla_lin_berr 1488 dla_porcond 1489 dla_porfsx_extended 1493 dla_porpvgrw 1498 dla_rpvgrw 1503 dla_syamv 1505 dla_syrcond 1507 dla_syrfsx_extended 1511 dla_syrpvgrw 1516 dla_wwaddw 1517 dlag2s 1427 dlapmr 1260 dlapmt 1262 dlarfb 1295 dlarft 1300 dlarscl2 1504 dlartgp 1324 dlartgs 1326 dlascl2 1504 dlat2s 1453 dlatps 1383 dlatrd 1385 dlatrs 1387 dlatrz 1390 dlauu2 1392 dlauum 1393 dNewAbstractStream 2133 dorbdb 925 dorcsd 1060 dorg2l 1394 dorg2r 1395 dorgl2 1396 dorgr2 1397 dorm2l 1399 dorm2r 1400 dorml2 1402 dormr2 1404 dormr3 1405 dot product complex vectors, conjugated 60 complex vectors, unconjugated 61 distributed complex vectors, conjugated 2381 distributed complex vectors, unconjugated 2382 distributed real vectors 2380 real vectors 58 real vectors (extended precision) 59 sparse complex vectors 145 sparse complex vectors, conjugated 144 sparse real vectors 143 dpbtf2 1407 dporfsx 472 dpotf2 1408 dpprfs 478 dpptrs 396 dptts2 1409 driver expert 1536 simple 1536 Driver Routines 557, 930 drscl 1411 dss_create 1916 dsyconv 436 dsygs2 1415 dsyswapr 1411 dsyswapr1 1414 dsytd2 1417 dsytf2 1418 dsytri2 523 dsytri2x 527 dsytrs2 406 dtgex2 1421 dtgsy2 1423 dtrexc 868 Intel® Math Kernel Library Reference Manual 2722 dtrnlsp_check 2499 dtrnlsp_delete 2503 dtrnlsp_get 2502 dtrnlsp_init 2497 dtrnlsp_solve 2500 dtrnlspbc_check 2506 dtrnlspbc_delete 2511 dtrnlspbc_get 2510 dtrnlspbc_init 2505 dtrnlspbc_solve 2508 dtrti2 1426 dzsum1 1165 E eigenpairs, sorting 1868 eigenvalue problems general matrix 833, 877, 1656 generalized form 819 Hermitian matrix 758 symmetric matrix 758 symmetric tridiagonal matrix 1870, 1878 eigenvalues eigenvalue problems 758 eigenvectors eigenvalue problems 758 elementary reflector complex matrix 1809 general matrix 1329, 1804 general rectangular matrix LAPACK 1294, 1302 ScaLAPACK 1793, 1798 LAPACK generation 1298, 1299 ScaLAPACK generation 1800 error diagnostics, in VML 1973 error estimation for linear equations distributed tridiagonal coefficient matrix 1576 error handling pxerbla 1882, 2530 xerbla 1973 errors in solutions of linear equations banded matrix 461, 1462, 1493 distributed tridiagonal coefficient matrix 1576 general matrix band storage 458 Hermitian indefinite matrix 496, 1482 Hermitian matrix packed storage 504 Hermitian positive-definite matrix band storage 480 packed storage 478 symmetric indefinite matrix 488, 1511 symmetric matrix packed storage 501 symmetric positive-definite matrix band storage 480 packed storage 478 triangular matrix band storage 511 packed storage 508 tridiagonal matrix 467 Estimates 2606 Euclidean norm of a distributed vector 2383 of a vector 62 expert driver 1536 Exponential 2165 F factorization Bunch-Kaufman LAPACK 357 ScaLAPACK 1538 Cholesky LAPACK 357, 1407, 1408 ScaLAPACK 1857 diagonal pivoting Hermitian matrix complex 1419 packed 663 symmetric matrix indefinite 1418 packed 657 LU LAPACK 357 ScaLAPACK 1538 orthogonal LAPACK 670 ScaLAPACK 1586 partial complex Hermitian indefinite matrix 1378 real/complex symmetric matrix 1377 triangular factorization 357, 1538 upper trapezoidal matrix 1390 fast Fourier transform DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiComputeBackward 2322 DftiComputeBackwardDM 2362 DftiComputeForwardDM 2360 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiErrorClass 2329 DftiErrorMessage 2331 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValueDM 2365 fast Fourier Transform DftiComputeForward 2320 FFT computation cluster FFT 2356 FFT functions descriptor manipulation DftiCommitDescriptor 2316 DftiCommitDescriptorDM 2358 DftiCopyDescriptor 2318 DftiCreateDescriptor 2314 DftiCreateDescriptorDM 2357 DftiFreeDescriptor 2317 DftiFreeDescriptorDM 2359 DFT computation DftiComputeBackward 2322 DftiComputeForward 2320 FFT computation DftiComputeForwardDM 2360 status checking DftiErrorClass 2329 DftiErrorMessage 2331 FFT Interface 2313 FFT routines descriptor configuration DftiGetValue 2327 DftiGetValueDM 2367 DftiSetValueDM 2365 Index 2723 FFT computation DftiComputeBackwardDM 2362 FFTW interface to Intel(R) MKL for FFTW2 2689 for FFTW3 2697 fill-in, for sparse matrices 2631 finding index of the element of a vector with the largest absolute value of the real part 1744 element of a vector with the largest absolute value 71 element of a vector with the largest absolute value of the real part and its global index 1745 element of a vector with the smallest absolute value 72 font conventions 41 Fortran 95 interface conventions BLAS, Sparse BLAS 52 LAPACK 351 Fortran 95 Interfaces for LAPACK absent from Netlib 2684 identical to Netlib 2681 modified Netlib interfaces 2684 new functionality 2687 with replaced Netlib argument names 2682 Fortran 95 Interfaces for LAPACK Routines specific MKL features Fortran 95 LAPACK interface vs. Netlib 352 free_Helmholtz_2D 2474 free_Helmholtz_3D 2474 free_sph_np 2480 free_sph_p 2480 free_trig_transform 2451 Frobenius norm complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 full storage scheme 2646 full-storage vectors 140 function name conventions, in VML 1970 G Gamma 2183 gathering sparse vector's elements into compressed form and writing zeros to these elements 147 Gaussian 2159 GaussianMV 2161 gbcon 422 gbsvx 576 gbtrs 387 gecon 420 general distributed matrix scalar-matrix-matrix product 2418 general matrix block reflector 1330, 1807 eigenvalue problems 833, 877, 1656 elementary reflector 1329, 1804 estimating the condition number band storage 422 inverting matrix LAPACK 514 ScaLAPACK 1578 LQ factorization 689, 1598 LU factorization band storage 359, 1166, 1540, 1542, 1872, 1873 matrix-vector product band storage 75 multiplying by orthogonal matrix from LQ factorization 1402, 1846 from QR factorization 1400, 1843 from RQ factorization 1404, 1849 from RZ factorization 1405 multiplying by unitary matrix from LQ factorization 1402, 1846 from QR factorization 1400, 1843 from RQ factorization 1404, 1849 from RZ factorization 1405 QL factorization LAPACK 700 ScaLAPACK 1608 QR factorization with pivoting 676, 678, 1589 rank-1 update 79 rank-1 update, conjugated 81 rank-1 update, unconjugated 82 reduction to bidiagonal form 1167, 1181, 1751 reduction to upper Hessenberg form 1754 RQ factorization LAPACK 710 ScaLAPACK 1636 scalar-matrix-matrix product 119, 333 solving systems of linear equations band storage LAPACK 387 ScaLAPACK 1551 general rectangular distributed matrix computing scaling factors 1583 equilibration 1583 general rectangular matrix 1-norm value LAPACK 1244 ScaLAPACK 1779 block reflector LAPACK 1295 ScaLAPACK 1795 elementary reflector LAPACK 1294, 1798 ScaLAPACK 1793 Frobenius norm LAPACK 1244 ScaLAPACK 1779 infinity- norm LAPACK 1244 ScaLAPACK 1779 largest absolute value of element LAPACK 1244 ScaLAPACK 1779 LQ factorization LAPACK 1170 ScaLAPACK 1756 multiplication LAPACK 1335 Intel® Math Kernel Library Reference Manual 2724 ScaLAPACK 1815 QL factorization LAPACK 1171 ScaLAPACK 1758 QR factorization LAPACK 1172, 1174 ScaLAPACK 1760 reduction of first columns LAPACK 1226, 1228 ScaLAPACK 1775 reduction to bidiagonal form 1765 row interchanges LAPACK 1374 ScaLAPACK 1821 RQ factorization LAPACK 1175 ScaLAPACK 1617, 1762 scaling 1787 general square matrix reduction to upper Hessenberg form 1168 trace 1822 general triangular matrix LU factorization band storage 1746 general tridiagonal matrix 1-norm value 1245 Frobenius norm 1245 infinity- norm 1245 largest absolute value of element 1245 general tridiagonal triangular matrix LU factorization band storage 1748 generalized eigenvalue problems complex Hermitian-definite problem band storage 829 packed storage 825 real symmetric-definite problem band storage 827 packed storage 823 See also LAPACK routines, generalized eigenvalue problems 819 Generalized LLS Problems 943 Generalized Nonsymmetric Eigenproblems 1120 generalized Schur factorization 1223, 1293, 1304, 1305 Generalized Singular Value Decomposition 910 generalized Sylvester equation 902 Generalized SymmetricDefinite Eigenproblems 1065 generation methods 2116 Geometric 2196 geqpf 676 GetBrngProperties 2210 getcpuclocks 2533 getcpufrequency 2534 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 GFSR 2118 Givens rotation modified Givens transformation parameters 67 of sparse vectors 148 parameters 64 global array 1535, 2373 global index of maximum element of a distributed vector 2376 gtrfs 467 Gumbel 2181 H Helmholtz problem three-dimensional 2461 two-dimensional 2458 Helmholtz problem on a sphere non-periodic 2459 periodic 2459 Hermitian band matrix 1-norm value 1248 Frobenius norm 1248 infinity- norm 1248 largest absolute value of element 1248 Hermitian distributed matrix rank-n update 2424 scalar-matrix-matrix product 2420 Hermitian indefinite matrix matrix-vector product 1478 Hermitian matrix Bunch-Kaufman factorization packed storage 383 eigenvalues and eigenvectors 1713, 1715, 1717 estimating the condition number packed storage 441 generalized eigenvalue problems 819 inverting the matrix packed storage 532 matrix-vector product band storage 84 packed storage 91 rank-1 update packed storage 92 rank-2 update packed storage 94 rank-2k update 126 rank-k update 124 reducing to standard form LAPACK 1415 ScaLAPACK 1859 reducing to tridiagonal form LAPACK 1385, 1417 ScaLAPACK 1823, 1861 scalar-matrix-matrix product 122 scaling 1789 solving systems of linear equations packed storage 411 Hermitian positive definite distributed matrix computing scaling factors 1584 equilibration 1584 Hermitian positive semidefinite matrix Cholesky factorization 366 Hermitian positive-definite band matrix Cholesky factorization 1407 Hermitian positive-definite distributed matrix inverting the matrix 1580 Hermitian positive-definite matrix Cholesky factorization band storage 371, 1546 packed storage 369 estimating the condition number band storage 430 packed storage 428 inverting the matrix packed storage 519 solving systems of linear equations band storage 398, 1558 packed storage 396 Hermitian positive-definite tridiagonal matrix solving systems of linear equations 1560 heswapr 1413 hetri2 525 hetri2x 529 hgeqz 885 Householder matrix LAPACK 1298, 1299 Index 2725 ScaLAPACK 1800 Householder reflector 1867 hseqr 851 Hypergeometric 2200 I i?amax 71 i?amin 72 i?max1 1164 IBM ESSL library 2214 IEEE arithmetic 1778 IEEE standard implementation 1880 signbit position 1882 ila?lr 1432 iladiag 1530 ilaenv 1520 ilaprec 1531 ilatrans 1531 ilauplo 1532 ilaver 1519 ILU0 preconditioner 1958 Incomplete LU Factorization Technique 1958 increment 2645 iNewAbstractStream 2131 infinity-norm complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 Interface Consideration 153 inverse matrix. inverting a matrix 514, 1578, 1580, 1581 inverting a matrix general matrix LAPACK 514 ScaLAPACK 1578 Hermitian matrix packed storage 532 Hermitian positive-definite matrix LAPACK 516 packed storage 519 ScaLAPACK 1580 symmetric matrix packed storage 530 symmetric positive-definite matrix LAPACK 516 packed storage 519 ScaLAPACK 1580 triangular distributed matrix 1581 triangular matrix packed storage 536 iparmq 1522 Iterative Sparse Solvers 1932 Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS) 1932 J Jacobi plane rotations 1051 Jacobian matrix calculation routines ?jacobi 2515 ?jacobi_delete 2514 ?jacobi_init 2512 ?jacobi_solve 2513 ?jacobix 2516 L la_gbamv 1455 la_gbrcond 1457 la_gbrcond_c 1459 la_gbrcond_x 1460 la_gercond 1470 la_gercond_c 1471 la_gercond_x 1472 la_hercond_c 1480 la_hercond_x 1481 la_lin_berr 1488 la_porcond 1489 la_porcond_c 1490 la_porcond_x 1492 la_syrcond 1507 la_syrcond_c 1508 la_syrcond_x 1509 LAPACK naming conventions 347 LAPACK auxiliary routines ?la_geamv 1468 ?la_heamv 1478 ?la_syamv 1505 ?larscl2 1504 ?lascl2 1504 LAPACK routines ?gsvj0 1432 ?gsvj1 1434 ?hfrk 1438 ?larfp 1429 ?sfrk 1437 2-by-2 generalized eigenvalue problem 1214 2-by-2 Hermitian matrix plane rotation 1293 2-by-2 orthogonal matrices 1216 2-by-2 real matrix generalized Schur factorization 1223 2-by-2 real nonsymmetric matrix Schur factorization 1259 2-by-2 symmetric matrix plane rotation 1293 2-by-2 triangular matrix singular values 1334 SVD 1373 approximation to smallest eigenvalue 1365 auxiliary routines ?gbtf2 1166 ?gebd2 1167 ?gehd2 1168 ?gelq2 1170 ?geql2 1171 ?geqr2 1172 ?geqr2p 1174 ?gerq2 1175 ?gesc2 1176 ?getc2 1177 ?getf2 1178 Intel® Math Kernel Library Reference Manual 2726 ?gtts2 1179 ?hetf2 1419 ?hfrk 1438 ?isnan 1180 ?la_gbrpvgrw 1467 ?la_herpvgrw 1487 ?la_porpvgrw 1498 ?la_rpvgrw 1503 ?la_syrpvgrw 1516 ?la_wwaddw 1517 ?labrd 1181 ?lacgv 1155 ?lacn2 1184 ?lacon 1185 ?lacp2 1455 ?lacpy 1186 ?lacrm 1156 ?lacrt 1156 ?ladiv 1187 ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 ?laein 1209 ?laesy 1157 ?laev2 1212 ?laexc 1213 ?lag2 1214 ?lags2 1216 ?lagtf 1218 ?lagtm 1220 ?lagts 1221 ?lagv2 1223 ?lahef 1378 ?lahqr 1224 ?lahr2 1228 ?lahrd 1226 ?laic1 1230 ?laisnan 1181 ?laln2 1232 ?lals0 1234 ?lalsa 1236 ?lalsd 1239 ?lamrg 1241 ?laneg 1242 ?langb 1243 ?lange 1244 ?langt 1245 ?lanhb 1248 ?lanhe 1253 ?lanhf 1443 ?lanhp 1250 ?lanhs 1246 ?lansb 1247 ?lansf 1442 ?lansp 1249 ?lanst/?lanht 1251 ?lansy 1252 ?lantb 1255 ?lantp 1256 ?lantr 1257 ?lanv2 1259 ?lapll 1259 ?lapmr 1260 ?lapmt 1262 ?lapy2 1262 ?lapy3 1263 ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqp2 1268 ?laqps 1269 ?laqr0 1270 ?laqr1 1273 ?laqr2 1274 ?laqr3 1277 ?laqr4 1280 ?laqr5 1282 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?laqtr 1289 ?lar1v 1290 ?lar2v 1293 ?larcm 1502 ?larf 1294 ?larfb 1295 ?larfg 1298 ?larfgp 1299 ?larfp 1429 ?larft 1300 ?larfx 1302 ?largv 1304 ?larnv 1305 ?larra 1306 ?larrb 1307 ?larrc 1309 ?larrd 1310 ?larre 1312 ?larrf 1315 ?larrj 1317 ?larrk 1318 ?larrr 1319 ?larrv 1320 ?lartg 1323 ?lartgp 1324 ?lartgs 1326 ?lartv 1327 ?laruv 1328 ?larz 1329 ?larzb 1330 ?larzt 1332 ?las2 1334 ?lascl 1335 ?lasd0 1336 ?lasd1 1338 ?lasd2 1340 ?lasd3 1342 ?lasd4 1344 ?lasd5 1346 ?lasd6 1347 ?lasd7 1350 ?lasd8 1353 ?lasd9 1354 ?lasda 1356 ?lasdq 1358 ?lasdt 1360 ?laset 1361 ?lasq1 1362 Index 2727 ?lasq2 1363 ?lasq3 1364 ?lasq4 1365 ?lasq5 1366 ?lasq6 1367 ?lasr 1368 ?lasrt 1371 ?lassq 1372 ?lasv2 1373 ?laswp 1374 ?lasy2 1375 ?lasyf 1377 ?latbs 1380 ?latdf 1382 ?latps 1383 ?latrd 1385 ?latrs 1387 ?latrz 1390 ?lauu2 1392 ?lauum 1393 ?orbdb/?unbdb 925 ?orcsd/?uncsd 1060 ?org2l/?ung2l 1394 ?org2r/?ung2r 1395 ?orgl2l/?ungl2 1396 ?orgr2/?ungr2 1397 ?orm2l/?unm2l 1399 ?orm2r/?unm2r 1400 ?orml2/?unml2 1402 ?ormr2/?unmr2 1404 ?ormr3/?unmr3 1405 ?pbtf2 1407 ?potf2 1408 ?pstf2 1451 ?ptts2 1409 ?rot 1158 ?rscl 1411 ?sfrk 1437 ?spmv 1159 ?spr 1161 ?sum1 1165 ?sygs2/?hegs2 1415 ?symv 1162 ?syr 1163 ?sytd2/?hetd2 1417 ?sytf2 1418 ?tfttp 1444 ?tfttr 1445 ?tgex2 1421 ?tgsy2 1423 ?tpttf 1446 ?tpttr 1448 ?trti2 1426 ?trttf 1449 ?trttp 1450 clag2z 1427 dlag2s 1427 dlat2s 1453 i?max1 1164 ila?lc 1431 ila?lr 1432 slag2d 1428 zlag2c 1429 zlat2c 1454 banded matrix equilibration ?gbequ 542 ?gbequb 545 bidiagonal divide and conquer 1360 block reflector triangular factor 1300, 1332 checking for safe infinity 1523 checking for strings equality 1524 complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex matrix multiplication 1156, 1502 complex symmetric matrix computing eigenvalues and eigenvectors 1157 matrix-vector product 1162 symmetric rank-1 update 1163 complex symmetric packed matrix symmetric rank-1 update 1161 complex vector 1-norm using true absolute value 1165 index of element with max absolute value 1164 linear transformation 1156 matrix-vector product 1159 plane rotation 1158 complex vector conjugation 1155 condition number estimation ?disna 818 ?gbcon 422 ?gecon 420 ?gtcon 424 ?hecon 438 ?hpcon 441 ?pbcon 430 ?pocon 426 ?ppcon 428 ?ptcon 432 ?spcon 439 ?sycon 434 ?tbcon 447 ?tpcon 445 ?trcon 443 determining machine parameters 1526 diagonally dominant triangular factorization ?dttrfb 363 dqd transform 1367 dqds transform 1366 driver routines generalized LLS problems ?ggglm 946 ?gglse 943 generalized nonsymmetric eigenproblems ?gges 1121 ?ggesx 1126 ?ggev 1132 ?ggevx 1136 Intel® Math Kernel Library Reference Manual 2728 generalized symmetric definite eigenproblems ?hbgv 1105 ?hbgvd 1110 ?hbgvx 1117 ?hegv 1068 ?hegvd 1074 ?hegvx 1081 ?hpgv 1087 ?hpgvd 1092 ?hpgvx 1099 ?sbgv 1103 ?sbgvd 1107 ?sbgvx 1113 ?spgv 1085 ?spgvd 1089 ?spgvx 1096 ?sygv 1066 ?sygvd 1071 ?sygvx 1077 linear least squares problems ?gels 930 ?gelsd 939 ?gelss 937 ?gelsy 933 ?lals0 (auxiliary) 1234 ?lalsa (auxiliary) 1236 ?lalsd (auxiliary) 1239 nonsymmetric eigenproblems ?gees 1020 ?geesx 1024 ?geev 1028 ?geevx 1032 singular value decomposition ?gejsv 1045 ?gelsd 939 ?gesdd 1041 ?gesvd 1037 ?gesvj 1051 ?ggsvd 1055 solving linear equations ?dtsvb 595 ?gbsv 574 ?gbsvx 576 ?gbsvxx 582 ?gesv 558 ?gesvx 561 ?gesvxx 567 ?gtsv 589 ?gtsvx 591 ?hesv 642 ?hesvx 645 ?hesvxx 649 ?hpsv 661 ?hpsvx 663 ?pbsv 617 ?pbsvx 619 ?posv 596 ?posvx 599 ?posvxx 604 ?ppsv 611 ?ppsvx 612 ?ptsv 623 ?ptsvx 625 ?spsv 655 ?spsvx 657 ?sysv 629 ?sysvx 631 ?sysvxx 635 symmetric eigenproblems ?hbev 993 ?hbevd 998 ?hbevx 1004 ?heev 951 ?heevd 956 ?heevr 970 ?heevx 963 ?hpev 977 ?hpevd 981 ?hpevx 988 ?sbev 991 ?sbevd 995 ?sbevx 1001 ?spev 975 ?spevd 979 ?spevx 985 ?stev 1008 ?stevd 1009 ?stevr 1015 ?stevx 1012 ?syev 949 ?syevd 954 ?syevr 966 ?syevx 959 environmental enquiry 1520, 1522 finding a relatively isolated eigenvalue 1315 general band matrix equilibration 1264 general matrix block reflector 1330 elementary reflector 1329 reduction to bidiagonal form 1167, 1181 general matrix equilibration ?geequ 538 ?geequb 540 general rectangular matrix block reflector 1295 elementary reflector 1294, 1302 equilibration 1265, 1499, 1501 LQ factorization 1170 plane rotation 1368 QL factorization 1171 QR factorization 1172, 1174 row interchanges 1374 RQ factorization 1175 general square matrix reduction to upper Hessenberg form 1168 general tridiagonal matrix 1218, 1220, 1221, 1245, 1312, 1320 generalized eigenvalue problems ?hbgst 829 ?hegst 822 ?hpgst 825 ?pbstf 831 ?sbgst 827 ?spgst 823 ?sygst 820 generalized SVD ?ggsvp 910 ?tgsja 914 generalized Sylvester equation ?tgsyl 902 Hermitian band matrix equilibration 1266, 1287 Index 2729 Hermitian band matrix in packed storage equilibration 1286 Hermitian indefinite matrix equilibration ?heequb 556 Hermitian matrix computing eigenvalues and eigenvectors 1212 Hermitian positive-definite matrix equilibration ?poequ 547 ?poequb 549 Householder matrix elementary reflector 1298, 1299 ila?lc 1431 ila?lr 1432 incremental condition estimation 1230 linear dependence of vectors 1259 LQ factorization ?gelq2 1170 ?gelqf 689 ?orglq 692 ?ormlq 694 ?unglq 696 ?unmlq 698 LU factorization general band matrix 1166 matrix equilibration ?laqgb 1264 ?laqge 1265 ?laqhb 1266 ?laqhe 1499 ?laqhp 1501 ?laqsb 1285 ?laqsp 1286 ?laqsy 1287 ?pbequ 552 ?ppequ 550 matrix inversion ?getri 514 ?hetri 522 ?hetri2 525 ?hetri2x 529 ?hptri 532 ?potri 516 ?pptri 519 ?sptri 530 ?sytri 520 ?sytri2 523 ?sytri2x 527 ?tptri 536 ?trtri 534 matrix-matrix product ?lagtm 1220 merging sets of singular values 1340, 1350 mixed precision iterative refinement subroutines 558, 596, 1427–1429 nonsymmetric eigenvalue problems ?gebak 849 ?gebal 847 ?gehrd 835 ?hsein 855 ?hseqr 851 ?orghr 837 ?ormhr 839 ?trevc 860 ?trexc 868 ?trsen 870 ?trsna 864 ?unghr 842 ?unmhr 844 off-diagonal and diagonal elements 1361 permutation list creation 1241 permutation of matrix columns 1262 permutation of matrix rows 1260 plane rotation 1323, 1324, 1326, 1327, 1368 plane rotation vector 1304 QL factorization ?geql2 1171 ?geqlf 700 ?orgql 702 ?ormql 706 ?ungql 704 ?unmql 708 QR factorization ?geqp3 678 ?geqpf 676 ?geqr2 1172 ?geqr2p 1174 ?geqrf 671 ?geqrfp 674 ?ggqrf 728 ?ggrqf 731 ?laqp2 1268 ?laqps 1269 ?orgqr 681 ?ormqr 683 ?ungqr 685 ?unmqr 687 p?geqrf 1587 random numbers vector 1305 real lower bidiagonal matrix SVD 1358 real square bidiagonal matrix singular values 1362 real symmetric matrix 1252 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1189, 1251 real upper bidiagonal matrix singular values 1336 SVD 1338, 1356, 1358 real upper quasi-triangular matrix orthogonal similarity transformation 1213 reciprocal condition numbers for eigenvalues and/or eigenvectors ?tgsna 906 rectangular full packed format 368, 395 RQ factorization ?geqr2 1175 ?gerqf 710 ?orgrq 712 ?ormrq 716 ?ungrq 714 ?unmrq 718 RZ factorization ?ormrz 723 ?tzrzf 720 ?unmrz 725 singular value decomposition ?bdsdc 756 ?bdsqr 752 ?gbbrd 739 ?gebrd 736 ?orgbr 742 ?ormbr 744 ?ungbr 747 ?unmbr 749 solution refinement and error estimation ?gbrfs 458 Intel® Math Kernel Library Reference Manual 2730 ?gbrfsx 461 ?gerfs 449 ?gerfsx 452 ?gtrfs 467 ?herfs 494 ?herfsx 496 ?hprfs 504 ?la_gbrfsx_extended 1462 ?la_gerfsx_extended 1473 ?la_herfsx_extended 1482 ?la_porfsx_extended 1493 ?la_syrfsx_extended 1511 ?pbrfs 480 ?porfs 469 ?porfsx 472 ?pprfs 478 ?ptrfs 483 ?sprfs 501 ?syrfs 485 ?syrfsx 488 ?tbrfs 511 ?tprfs 508 ?trrfs 506 solving linear equations ?dttrsb 392 ?gbtrs 387 ?getrs 385 ?gttrs 389 ?heswapr 1413 ?hetrs 404 ?hetrs2 408 ?hptrs 411 ?laln2 1232 ?laqtr 1289 ?pbtrs 398 ?pftrs 395 ?potrs 393 ?pptrs 396 ?pttrs 400 ?sptrs 409 ?syswapr 1411 ?syswapr1 1414 ?sytrs 402 ?sytrs2 406 ?tbtrs 418 ?tptrs 416 ?trtrs 413 sorting numbers 1371 square root 1262, 1263 square roots 1342, 1344, 1346, 1353, 1354, 1524 Sylvester equation ?lasy2 1375 ?tgsy2 1423 ?trsyl 874 symmetric band matrix equilibration 1285, 1287 symmetric band matrix in packed storage equilibration 1286 symmetric eigenvalue problems ?disna 818 ?hbtrd 791 ?herdb 766 ?hetrd 772 ?hptrd 784 ?opgtr 781 ?opmtr 782 ?orgtr 768 ?ormtr 770 ?pteqr 810 ?sbtrd 789 ?sptrd 779 ?stebz 813 ?stedc 801 ?stegr 805 ?stein 815 ?stemr 798 ?steqr 795 ?sterf 793 ?syrdb 764 ?sytrd 762 ?ungtr 775 ?unmtr 776 ?upgtr 786 ?upmtr 787 auxiliary ?lae2 1188 ?laebz 1189 ?laed0 1192 ?laed1 1194 ?laed2 1195 ?laed3 1197 ?laed4 1199 ?laed5 1200 ?laed6 1200 ?laed7 1202 ?laed8 1204 ?laed9 1207 ?laeda 1208 symmetric indefinite matrix equilibration ?syequb 554 symmetric matrix computing eigenvalues and eigenvectors 1212 packed storage 1249 symmetric positive-definite matrix equilibration ?poequ 547 ?poequb 549 symmetric positive-definite tridiagonal matrix eigenvalues 1363 trapezoidal matrix 1257, 1390 triangular factorization ?gbtrf 359 ?getrf 357 ?gttrf 361 ?hetrf 378 ?hptrf 383 ?pbtrf 371 ?potrf 364 ?pptrf 369 ?pstrf 366 ?pttrf 373 ?sptrf 381 ?sytrf 374 p?dbtrf 1542 triangular matrix packed storage 1256 triangular matrix factorization ?pftrf 368 ?pftri 517 ?tftri 535 triangular system of equations 1383, 1387 tridiagonal band matrix 1255 uniform distribution 1328 unreduced symmetric tridiagonal matrix 1192 updated upper bidiagonal matrix Index 2731 SVD 1347 updating sum of squares 1372 upper Hessenberg matrix computing a specified eigenvector 1209 eigenvalues 1224 Schur factorization 1224 utility functions and routines ?labad 1524 ?lamc1 1526 ?lamc2 1526 ?lamc3 1527 ?lamc4 1528 ?lamc5 1528 ?lamch 1525 chla_transtype 1529 ieeeck 1523 iladiag 1530 ilaenv 1520 ilaprec 1531 ilatrans 1531 ilauplo 1532 ilaver 1519 iparmq 1522 lsamen 1524 second/dsecnd 1529 xerbla_array 1532 Laplace 2168 Laplace problem three-dimensional 2461 two-dimensional 2459 largest absolute value of element complex Hermitian matrix packed storage 1250 complex Hermitian matrix in RFP format 1443 complex Hermitian tridiagonal matrix 1251 complex symmetric matrix 1252 general rectangular matrix 1244, 1779 general tridiagonal matrix 1245 Hermitian band matrix 1248 real symmetric matrix 1252, 1782 real symmetric matrix in RFP format 1442 real symmetric tridiagonal matrix 1251 symmetric band matrix 1247 symmetric matrix packed storage 1249 trapezoidal matrix 1257 triangular band matrix 1255 triangular matrix packed storage 1256 upper Hessenberg matrix 1246, 1780 leading dimension 2648 leapfrog method 2121 LeapfrogStream 2146 least squares problems length. dimension 2645 library version 2521 Library Version Obtaining 2521 library version string 2523 linear combination of distributed vectors 2378 linear combination of vectors 55, 327 Linear Congruential Generator 2118 linear equations, solving tridiagonal symmetric positive-definite matrix LAPACK 623 ScaLAPACK 1699 band matrix LAPACK 574, 576 ScaLAPACK 1685 banded matrix extra precise interative refinement LAPACK 582 extra precise iterative refinement 461, 1462, 1493 LAPACK 582 Cholesky-factored matrix LAPACK 398 ScaLAPACK 1558 diagonally dominant tridiagonal matrix LAPACK 392, 595 diagonally dominant-like matrix banded 1553 tridiagonal 1555 general band matrix ScaLAPACK 1687 general matrix band storage 387, 1551 extra precise interative refinement 452 extra precise iterative refinement 1473 general tridiagonal matrix ScaLAPACK 1689 Hermitian indefinite matrix extra precise interative refinement LAPACK 649 extra precise iterative refinement 1482 LAPACK 649 Hermitian matrix error bounds 645, 663 packed storage 411, 661, 663 Hermitian positive-definite matrix band storage LAPACK 617 ScaLAPACK 1697 error bounds LAPACK 599 ScaLAPACK 1693 extra precise interative refinement LAPACK 604 LAPACK linear equations, solving multiple right-sides symmetric packed storage 396, 611, 612 ScaLAPACK 1693 Hermitian positive-definite tridiagonal linear equations 1876 Hermitian positive-definite tridiagonal matrix 1560 multiple right-hand sides band matrix LAPACK 574, 576 ScaLAPACK 1685 banded matrix LAPACK 582 diagonally dominant tridiagonal matrix 595 Hermitian indefinite matrix LAPACK 649 Hermitian matrix 642, 661 Hermitian positive-definite matrix band storage 617 square matrix LAPACK 558, 561, 567 ScaLAPACK 1679, 1681 symmetric indefinite matrix LAPACK 635 symmetric matrix 629, 655 symmetric positive-definite matrix band storage 617 Intel® Math Kernel Library Reference Manual 2732 symmetric/Hermitian positive-definite matrix LAPACK 604 tridiagonal matrix 589, 591 overestimated or underestimated system 1701 square matrix error bounds LAPACK 561, 576 ScaLAPACK 1681 extra precise interative refinement LAPACK 567 LAPACK 558, 561, 567 ScaLAPACK 1679, 1681 symmetric indefinite matrix extra precise interative refinement LAPACK 635 extra precise iterative refinement 1511 LAPACK 635 symmetric matrix error bounds 631, 657 packed storage 409, 655, 657 symmetric positive-definite matrix band storage LAPACK 617 ScaLAPACK 1697 error bounds LAPACK 599 ScaLAPACK 1693 extra precise interative refinement LAPACK 472, 604 LAPACK 596, 599, 604 packed storage 396, 611, 612 ScaLAPACK 1691, 1693 symmetric positive-definite tridiagonal linear equations 1876 triangular matrix band storage 418, 1851 packed storage 416 tridiagonal Hermitian positive-definite matrix error bounds 625 LAPACK 623 ScaLAPACK 1699 tridiagonal matrix error bounds 591 LAPACK 389, 400, 589, 591 LAPACK auxiliary 1290 ScaLAPACK auxiliary 1875 tridiagonal symmetric positive-definite matrix error bounds 625 Linear Least Squares (LLS) Problems 930 LoadStreamF 2141 LoadStreamM 2144 Lognormal 2178 LQ factorization computing the elements of orthogonal matrix Q 692 real orthogonal matrix Q 1600 unitary matrix Q 696, 1602 general rectangular matrix 1170, 1756 lsame 2530 lsamen 1524, 2531 LU factorization band matrix blocked algorithm 1873 unblocked algorithm 1872 diagonally dominant tridiagonal matrix 363 diagonally dominant-like tridiagonal matrix 1543 general band matrix 1166 general matrix 1178, 1763 solving linear equations general matrix 1176 square matrix 1681 tridiagonal matrix 1179, 1221 triangular band matrix 1746 tridiagonal band matrix 1748 tridiagonal matrix 361, 1218, 1874 with complete pivoting 1177, 1382 with partial pivoting 1178, 1763 M machine parameters LAPACK 1525 ScaLAPACK 1881 matrix arguments column-major ordering 2645, 2648 example 2649 leading dimension 2648 number of columns 2648 number of rows 2648 transposition parameter 2648 matrix block QR factorization with pivoting 1268 matrix converters mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrsky 313 mkl_?dnscsr 298 matrix equation AX = B 138, 355, 385, 1440, 1536, 1550 matrix one-dimensional substructures 2645 matrix-matrix operation product general distributed matrix 2418 general matrix 119, 333 rank-2k update Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k update Hermitian matrix 124 symmetric distributed matrix 2428 rank-n update symmetric matrix 131 scalar-matrix-matrix product Hermitian distributed matrix 2420 Hermitian matrix 122 symmetric distributed matrix 2426 symmetric matrix 128 matrix-matrix operation:scalar-matrix-matrix product triangular distributed matrix 2435 triangular matrix 135 matrix-vector operation product Hermitian matrix 84, 86, 91 real symmetric matrix 98, 102 triangular matrix 107, 112, 115 rank-1 update Hermitian matrix 87, 92 real symmetric matrix 99, 104 rank-2 update Hermitian matrix 89, 94 symmetric matrix 101, 106 matrix-vector operation:product Hermitian matrix band storage 84 packed storage 91 Index 2733 real symmetric matrix packed storage 98 symmetric matrix band storage 95 triangular matrix band storage 107 packed storage 112 matrix-vector operation:rank-1 update Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 matrix-vector operation:rank-2 update Hermitian matrix packed storage 94 symmetric matrix packed storage 101 mkl_?bsrgemv 164 mkl_?bsrmm 246 mkl_?bsrmv 218 mkl_?bsrsm 268 mkl_?bsrsv 232 mkl_?bsrsymv 173 mkl_?bsrtrsv 184 mkl_?coogemv 166 mkl_?coomm 254 mkl_?coomv 225 mkl_?coosm 265 mkl_?coosv 239 mkl_?coosymv 176 mkl_?cootrsv 186 mkl_?cscmm 250 mkl_?cscmv 222 mkl_?cscsm 261 mkl_?cscsv 235 mkl_?csradd 316 mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrgemv 161 mkl_?csrmm 242 mkl_?csrmultcsr 320 mkl_?csrmultd 324 mkl_?csrmv 215 mkl_?csrsky 313 mkl_?csrsm 257 mkl_?csrsv 228 mkl_?csrsymv 171 mkl_?csrtrsv 181 mkl_?diagemv 169 mkl_?diamm 284 mkl_?diamv 272 mkl_?diasm 291 mkl_?diasv 278 mkl_?diasymv 178 mkl_?diatrsv 189 mkl_?dnscsr 298 mkl_?imatcopy 335 mkl_?omatadd 344 mkl_?omatcopy 338 mkl_?omatcopy2 341 mkl_?skymm 288 mkl_?skymv 275 mkl_?skysm 295 mkl_?skysv 281 mkl_cspblas_?bsrgemv 194 mkl_cspblas_?bsrsymv 202 mkl_cspblas_?bsrtrsv 209 mkl_cspblas_?coogemv 197 mkl_cspblas_?coosymv 204 mkl_cspblas_?csrgemv 192 mkl_cspblas_?csrsymv 199 mkl_cspblas_?csrtrsv 207 mkl_cspblas_?dcootrsv 212 mkl_disable_fast_mm 2538 MKL_Disable_Fast_MM 2538 mkl_domain_get_max_threads 2527 MKL_Domain_Get_Max_Threads 2527 mkl_domain_set_num_threads 2525 MKL_Domain_Set_Num_Threads 2525 mkl_enable_instructions 2544 MKL_Enable_Instructions 2544 mkl_free usage example 2540 MKL_free 2540 mkl_free_buffers 2536 MKL_Free_Buffers 2536 MKL_FreeBuffers 2536 mkl_get_clocks_frequency 2535 MKL_Get_Clocks_Frequency 2535 mkl_get_cpu_clocks 2533 MKL_Get_Cpu_Clocks 2533 mkl_get_cpu_frequency 2534 MKL_Get_Cpu_Frequency 2534 mkl_get_dynamic 2528 MKL_Get_Dynamic 2528 mkl_get_max_cpu_frequency 2534 MKL_Get_Max_Cpu_Frequency 2534 mkl_get_max_threads 2526 MKL_Get_Max_Threads 2526 mkl_get_version 2521 MKL_Get_Version 2521 mkl_get_version_string 2523 mkl_malloc usage example 2540 MKL_malloc 2539 mkl_mem_stat usage example 2540 MKL_Mem_Stat 2538 MKL_MemStat 2538 mkl_progress 2542 mkl_set_dynamic 2526 MKL_Set_Dynamic 2526 mkl_set_interface_layer 2545 mkl_set_num_threads 2524 MKL_Set_Num_Threads 2524 mkl_set_progress 2547 mkl_set_threading_layer 2546 mkl_set_xerbla 2546 mkl_thread_free_buffers 2537 MKL_Thread_Free_Buffers 2537 MKLGetVersion 2521 MKLGetVersionString 2523 MPI Multiplicative Congruential Generator 2118 N naming conventions BLAS 51 LAPACK 668, 1536 Nonlinear Optimization Solvers 2496 PBLAS 2374 Sparse BLAS Level 1 140 Sparse BLAS Level 2 151 Sparse BLAS Level 3 151 VML 1970 negative eigenvalues 1778 NegBinomial 2206 Intel® Math Kernel Library Reference Manual 2734 NewStream 2128 NewStreamEx 2129 NewTaskX1D 2228 Nonsymmetric Eigenproblems 1019 O off-diagonal elements initialization 1817 LAPACK 1361 ScaLAPACK 1817 one-dimensional FFTs storage effects 2341–2343 orthogonal matrix CS decomposition LAPACK 920, 925, 1060 from LQ factorization LAPACK 1396 ScaLAPACK 1836 from QL factorization LAPACK 1394, 1399 ScaLAPACK 1833, 1840 from QR factorization LAPACK 1395 ScaLAPACK 1835 from RQ factorization LAPACK 1397 ScaLAPACK 1838 P p?agemv 2389 p?ahemv 2397 p?amax 2376 p?asum 2377 p?asymv 2404 p?atrmv 2410 p?axpy 2378 p?copy 2379 p?dbsv 1687 p?dbtrf 1542 p?dbtrs 1553 p?dbtrsv 1746 p?dot 2380 p?dotc 2381 p?dotu 2382 p?dtsv 1689 p?dttrf 1543 p?dttrs 1555 p?dttrsv 1748 p?gbsv 1685 p?gbtrf 1540 p?gbtrs 1551 p?geadd 2415 p?gebd2 1751 p?gebrd 1666 p?gecon 1564 p?geequ 1583 p?gehd2 1754 p?gehrd 1657 p?gelq2 1756 p?gelqf 1598 p?gels 1701 p?gemm 2418 p?gemv 2387 p?geql2 1758 p?geqlf 1608 p?geqpf 1589 p?geqr2 1760 p?geqrf 1587 p?ger 2391 p?gerc 2393 p?gerfs 1570 p?gerq2 1762 p?gerqf 1617 p?geru 2394 p?gesv 1679 p?gesvd 1723 p?gesvx 1681 p?getf2 1763 p?getrf 1538 p?getri 1578 p?getrs 1550 p?ggqrf 1633 p?ggrqf 1636 p?heev 1713 p?heevd 1715 p?heevx 1717 p?hegst 1677 p?hegvx 1732 p?hemm 2420 p?hemv 2396 p?her 2399 p?her2 2400 p?her2k 2424 p?herk 2422 p?hetrd 1646 p?labad 1879 p?labrd 1765 p?lachkieee 1880 p?lacon 1768 p?laconsb 1769 p?lacp2 1770 p?lacp3 1772 p?lacpy 1773 p?laevswp 1774 p?lahqr 1664 p?lahrd 1775 p?laiect 1778 p?lamch 1881 p?lange 1779 p?lanhs 1780 p?lantr 1783 p?lapiv 1785 p?laqge 1787 p?laqsy 1789 p?lared1d 1791 p?lared2d 1792 p?larf 1793 p?larfb 1795 p?larfc 1798 p?larfg 1800 p?larft 1802 p?larz 1804 p?larzb 1807 p?larzt 1813 p?lascl 1815 p?laset 1817 p?lasmsub 1818 p?lasnbt 1882 p?lassq 1819 p?laswp 1821 p?latra 1822 p?latrd 1823 p?latrz 1828 p?lauu2 1830 p?lauum 1831 p?lawil 1832 p?max1 1744 p?nrm2 2383 Index 2735 p?org2l/p?ung2l 1833 p?org2r/p?ung2r 1835 p?orgl2/p?ungl2 1836 p?orglq 1600 p?orgql 1609 p?orgqr 1591 p?orgr2/p?ungr2 1838 p?orgrq 1619 p?orm2l/p?unm2l 1840 p?orm2r/p?unm2r 1843 p?ormbr 1669 p?ormhr 1659 p?orml2/p?unml2 1846 p?ormlq 1603 p?ormql 1612 p?ormqr 1594 p?ormr2/p?unmr2 1849 p?ormrq 1622 p?ormrz 1628 p?ormtr 1643 p?pbsv 1697 p?pbtrf 1546 p?pbtrs 1558 p?pbtrsv 1851 p?pocon 1566 p?poequ 1584 p?porfs 1573 p?posv 1691 p?posvx 1693 p?potf2 1857 p?potrf 1545 p?potri 1580 p?potrs 1557 p?ptsv 1699 p?pttrf 1548 p?pttrs 1560 p?pttrsv 1854 p?rscl 1858 p?scal 2384 p?stebz 1651 p?stein 1653 p?sum1 1745 p?swap 2385 p?syev 1704 p?syevd 1706 p?syevx 1708 p?sygs2/p?hegs2 1859 p?sygst 1676 p?sygvx 1726 p?symm 2426 p?symv 2402 p?syr 2406 p?syr2 2407 p?syr2k 2430 p?syrk 2428 p?sytd2/p?hetd2 1861 p?sytrd 1640 p?tradd 2416 p?tran 2432 p?tranc 2434 p?tranu 2433 p?trcon 1568 p?trmm 2435 p?trmv 2409 p?trrfs 1576 p?trsm 2437 p?trsv 2413 p?trti2 1864 p?trtri 1581 p?trtrs 1562 p?tzrzf 1626 p?unglq 1602 p?ungql 1611 p?ungqr 1592 p?ungrq 1620 p?unmbr 1672 p?unmhr 1662 p?unmlq 1605 p?unmql 1615 p?unmqr 1596 p?unmrq 1624 p?unmrz 1631 p?unmtr 1648 Packed formats 2347 packed storage scheme 2646 parallel direct solver (Pardiso) 1885 parallel direct sparse solver interface pardiso 1886 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 pardisoinit 1902 parameters for a Givens rotation 64 modified Givens transformation 67 pardiso 1886 PARDISO parameters 1905 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 PARDISO* solver 1885 pardisoinit 1902 Partial Differential Equations support Helmholtz problem on a sphere 2459 Poisson problem on a sphere 2460 three-dimensional Helmholtz problem 2461 three-dimensional Laplace problem 2461 three-dimensional Poisson problem 2461 two-dimensional Helmholtz problem 2458 two-dimensional Laplace problem 2459 two-dimensional Poisson problem 2458 PBLAS Level 1 functions p?amax 2376 p?asum 2377 p?dot 2380 p?dotc 2381 p?dotu 2382 p?nrm2 2383 PBLAS Level 1 routines p?amax 2375 p?asum 2375 p?axpy 2375, 2378 p?copy 2375, 2379 p?dot 2375 p?dotc 2375 p?dotu 2375 p?nrm2 2375 p?scal 2375, 2384 p?swap 2375, 2385 PBLAS Level 2 routines ?agemv 2386 ?asymv 2386 ?gemv 2386 ?ger 2386 ?gerc 2386 ?geru 2386 ?hemv 2386 ?her 2386 ?her2 2386 ?symv 2386 Intel® Math Kernel Library Reference Manual 2736 ?syr 2386 ?syr2 2386 ?trmv 2386 ?trsv 2386 p?agemv 2389 p?ahemv 2397 p?asymv 2404 p?atrmv 2410 p?gemv 2387 p?ger 2391 p?gerc 2393 p?geru 2394 p?hemv 2396 p?her 2399 p?her2 2400 p?symv 2402 p?syr 2406 p?syr2 2407 p?trmv 2409 p?trsv 2413 PBLAS Level 3 routines p?geadd 2415 p?gemm 2414, 2418 p?hemm 2414, 2420 p?her2k 2414, 2424 p?herk 2414, 2422 p?symm 2414, 2426 p?syr2k 2414, 2430 p?syrk 2414, 2428 p?tradd 2416 p?tran 2432 p?tranc 2434 p?tranu 2433 p?trmm 2414, 2435 p?trsm 2414, 2437 PBLAS routines routine groups pcagemv 2389 pcahemv 2397 pcamax 2376 pcatrmv 2410 pcaxpy 2378 pccopy 2379 pcdotc 2381 pcdotu 2382 pcgeadd 2415 pcgecon 1564 pcgemm 2418 pcgemv 2387 pcgerc 2393 pcgeru 2394 pchemm 2420 pchemv 2396 pcher 2399 pcher2 2400 pcher2k 2424 pcherk 2422 pcnrm2 2383 pcscal 2384 pcsscal 2384 pcswap 2385 pcsymm 2426 pcsyr2k 2430 pcsyrk 2428 pctradd 2416 pctranu 2433 pctrmm 2435 pctrmv 2409 pctrsm 2437 pctrsv 2413 pdagemv 2389 pdamax 2376 pdasum 2377 pdasymv 2404 pdatrmv 2410 pdaxpy 2378 pdcopy 2379 pddot 2380 PDE support pdgeadd 2415 pdgecon 1564 pdgemm 2418 pdgemv 2387 pdger 2391 pdlaiectb 1778 pdlaiectl 1778 pdnrm2 2383 pdscal 2384 pdswap 2385 pdsymm 2426 pdsymv 2402 pdsyr 2406 pdsyr2 2407 pdsyr2k 2430 pdsyrk 2428 pdtradd 2416 pdtran 2432 pdtranc 2434 pdtrmm 2435 pdtrmv 2409 pdtrsm 2437 pdtrsv 2413 pdzasum 2377 permutation matrix 2630 picopy 2379 pivoting matrix rows or columns 1785 PL Interface 2457 points rotation in the modified plane 65 in the plane 63 Poisson 2202 Poisson Library routines ?_commit_Helmholtz_2D 2467 ?_commit_Helmholtz_3D 2467 ?_commit_sph_np 2476 ?_commit_sph_p 2476 ?_Helmholtz_2D 2470 ?_Helmholtz_3D 2470 ?_init_Helmholtz_2D 2465 ?_init_Helmholtz_3D 2465 ?_init_sph_np 2475 ?_init_sph_p 2475 ?_sph_np 2478 ?_sph_p 2478 free_Helmholtz_2D 2474 free_Helmholtz_3D 2474 free_sph_np 2480 free_sph_p 2480 structure 2457 Poisson problem on a sphere 2460 three-dimensional 2461 two-dimensional 2458 PoissonV 2204 pprfs 478 pptrs 396 preconditioned Jacobi SVD 1045 preconditioners based on incomplete LU factorization dcsrilu0 1961 Index 2737 dcsrilut 1963 Preconditioners Interface Description 1960 process grid 1535, 2373 product distributed matrix-vector general matrix 2387, 2389 distributed vector-scalar 2384 matrix-vector distributed Hermitian matrix 2396, 2397 distributed symmetric matrix 2402, 2404 distributed triangular matrix 2409, 2410 general matrix 75, 77, 329, 331, 1468 Hermitian indefinite matrix 1478 Hermitian matrix 84, 86, 91 real symmetric matrix 98, 102 symmetric indefinite matrix 1505 triangular matrix 107, 112, 115 scalar-matrix general distributed matrix 2418 general matrix 119, 333 Hermitian distributed matrix 2420 Hermitian matrix 122 scalar-matrix-matrix general distributed matrix 2418 general matrix 119, 333 Hermitian distributed matrix 2420 Hermitian matrix 122 symmetric distributed matrix 2426 symmetric matrix 128 triangular distributed matrix 2435 triangular matrix 135 vector-scalar 69 product:matrix-vector general matrix band storage 75 Hermitian matrix band storage 84 packed storage 91 real symmetric matrix packed storage 98 symmetric matrix band storage 95 triangular matrix band storage 107 packed storage 112 psagemv 2389 psamax 2376 psasum 2377 psasymv 2404 psatrmv 2410 psaxpy 2378 pscasum 2377 pscopy 2379 psdot 2380 pseudorandom numbers psgeadd 2415 psgecon 1564 psgemm 2418 psgemv 2387 psger 2391 pslaiect 1778 psnrm2 2383 psscal 2384 psswap 2385 pssymm 2426 pssymv 2402 pssyr 2406 pssyr2 2407 pssyr2k 2430 pssyrk 2428 pstradd 2416 pstran 2432 pstranc 2434 pstrmm 2435 pstrmv 2409 pstrsm 2437 pstrsv 2413 pxerbla 1882, 2530 pzagemv 2389 pzahemv 2397 pzamax 2376 pzatrmv 2410 pzaxpy 2378 pzcopy 2379 pzdotc 2381 pzdotu 2382 pzdscal 2384 pzgeadd 2415 pzgecon 1564 pzgemm 2418 pzgemv 2387 pzgerc 2393 pzgeru 2394 pzhemm 2420 pzhemv 2396 pzher 2399 pzher2 2400 pzher2k 2424 pzherk 2422 pznrm2 2383 pzscal 2384 pzswap 2385 pzsymm 2426 pzsyr2k 2430 pzsyrk 2428 pztradd 2416 pztranu 2433 pztrmm 2435 pztrmv 2409 pztrsm 2437 pztrsv 2413 Q QL factorization computing the elements of complex matrix Q 704 orthogonal matrix Q 1609 real matrix Q 702 unitary matrix Q 1611 general rectangular matrix LAPACK 1171 ScaLAPACK 1758 multiplying general matrix by orthogonal matrix Q 1612 unitary matrix Q 1615 QR factorization computing the elements of orthogonal matrix Q 681, 1591 unitary matrix Q 685, 1592 general rectangular matrix LAPACK 1172, 1174, 1175 ScaLAPACK 1760, 1762 with pivoting ScaLAPACK 1589 quasi-random numbers quasi-triangular matrix LAPACK 833, 877 ScaLAPACK 1656 quasi-triangular system of equations 1289 Intel® Math Kernel Library Reference Manual 2738 R random number generators 2115 random stream 2123 random stream descriptor 2117 Random Streams 2123 rank-1 update conjugated, distributed general matrix 2393 conjugated, general matrix 81 distributed general matrix 2391 distributed Hermitian matrix 2399 distributed symmetric matrix 2406 general matrix 79 Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 unconjugated, distributed general matrix 2394 unconjugated, general matrix 82 rank-2 update distributed Hermitian matrix 2400 distributed symmetric matrix 2407 Hermitian matrix packed storage 94 symmetric matrix packed storage 101 rank-2k update Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k update distributed Hermitian matrix 2422 Hermitian matrix 124 symmetric distributed matrix 2428 rank-n update symmetric matrix 131 Rayleigh 2175 RCI CG Interface 1933 RCI CG sparse solver routines dcg 1946, 1950 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 RCI FGMRES Interface 1938 RCI FGMRES sparse solver routines dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 RCI GFMRES sparse solver routines dfgres 1954 RCI ISS 1932 RCI ISS interface 1932 RCI ISS sparse solver routines implementation details 1957 real matrix QR factorization with pivoting 1269 real symmetric matrix 1-norm value 1252 Frobenius norm 1252 infinity- norm 1252 largest absolute value of element 1252 real symmetric tridiagonal matrix 1-norm value 1251 Frobenius norm 1251 infinity- norm 1251 largest absolute value of element 1251 reducing generalized eigenvalue problems LAPACK 820 ScaLAPACK 1676 reduction to upper Hessenberg form general matrix 1754 general square matrix 1168 refining solutions of linear equations band matrix 458 banded matrix 461, 1462, 1493 general matrix 449, 452, 1473, 1570 Hermitian indefinite matrix 496, 1482 Hermitian matrix packed storage 504 Hermitian positive-definite matrix band storage 480 packed storage 478 symmetric indefinite matrix 488, 1511 symmetric matrix packed storage 501 symmetric positive-definite matrix band storage 480 packed storage 478 symmetric/Hermitian positive-definite distributed matrix 1573 tridiagonal matrix 467 RegisterBrng 2209 registering a basic generator 2208 reordering of matrices 2631 Reverse Communication Interface 1932 rotation of points in the modified plane 65 of points in the plane 63 of sparse vectors 148 parameters for a Givens rotation 64 parameters of modified Givens transformation 67 routine name conventions BLAS 51 Nonlinear Optimization Solvers 2496 PBLAS 2374 Sparse BLAS Level 1 140 Sparse BLAS Level 2 151 Sparse BLAS Level 3 151 RQ factorization computing the elements of complex matrix Q 714 orthogonal matrix Q 1619 real matrix Q 712 unitary matrix Q 1620 S SaveStreamF 2140 SaveStreamM 2142 sbbcsd 920 sbdsdc 756 ScaLAPACK ScaLAPACK routines 1D array redistribution 1791, 1792 auxiliary routines ?combamax1 1745 ?dbtf2 1872 ?dbtrf 1873 ?dttrf 1874 ?dttrsv 1875 ?lamsh 1866 ?laref 1867 Index 2739 ?lasorte 1868 ?lasrt2 1869 ?pttrsv 1876 ?stein2 1870 ?steqr2 1878 p?dbtrsv 1746 p?dttrsv 1748 p?gebd2 1751 p?gehd2 1754 p?gelq2 1756 p?geql2 1758 p?geqr2 1760 p?gerq2 1762 p?getf2 1763 p?labrd 1765 p?lacgv 1743 p?lacon 1768 p?laconsb 1769 p?lacp2 1770 p?lacp3 1772 p?lacpy 1773 p?laevswp 1774 p?lahrd 1775 p?laiect 1778 p?lange 1779 p?lanhs 1780 p?lansy, p?lanhe 1782 p?lantr 1783 p?lapiv 1785 p?laqge 1787 p?laqsy 1789 p?lared1d 1791 p?lared2d 1792 p?larf 1793 p?larfb 1795 p?larfc 1798 p?larfg 1800 p?larft 1802 p?larz 1804 p?larzb 1807 p?larzc 1809 p?larzt 1813 p?lascl 1815 p?laset 1817 p?lasmsub 1818 p?lassq 1819 p?laswp 1821 p?latra 1822 p?latrd 1823 p?latrs 1826 p?latrz 1828 p?lauu2 1830 p?lauum 1831 p?lawil 1832 p?max1 1744 p?org2l/p?ung2l 1833 p?org2r/p?ung2r 1835 p?orgl2/p?ungl2 1836 p?orgr2/p?ungr2 1838 p?orm2l/p?unm2l 1840 p?orm2r/p?unm2r 1843 p?orml2/p?unml2 1846 p?ormr2/p?unmr2 1849 p?pbtrsv 1851 p?potf2 1857 p?pttrsv 1854 p?rscl 1858 p?sum1 1745 p?sygs2/p?hegs2 1859 p?sytd2/p?hetd2 1861 p?trti2 1864 pdlaiectb 1778 pdlaiectl 1778 pslaiect 1778 block reflector triangular factor 1802, 1813 Cholesky factorization 1548 complex matrix complex elementary reflector 1809 complex vector 1-norm using true absolute value 1745 complex vector conjugation 1743 condition number estimation p?gecon 1564 p?pocon 1566 p?trcon 1568 driver routines p?dbsv 1687 p?dtsv 1689 p?gbsv 1685 p?gels 1701 p?gesv 1679 p?gesvd 1723 p?gesvx 1681 p?heev 1713 p?heevd 1715 p?heevx 1717 p?hegvx 1732 p?pbsv 1697 p?posv 1691 p?posvx 1693 p?ptsv 1699 p?syev 1704 p?syevd 1706 p?syevx 1708 p?sygvx 1726 error estimation p?trrfs 1576 error handling pxerbla 1882, 2530 general matrix block reflector 1807 elementary reflector 1804 LU factorization 1763 reduction to upper Hessenberg form 1754 general rectangular matrix elementary reflector 1793 LQ factorization 1756 QL factorization 1758 QR factorization 1760 reduction to bidiagonal form 1765 reduction to real bidiagonal form 1751 row interchanges 1821 RQ factorization 1762 generalized eigenvalue problems p?hegst 1677 p?sygst 1676 Householder matrix elementary reflector 1800 LQ factorization p?gelq2 1756 p?gelqf 1598 p?orglq 1600 p?ormlq 1603 p?unglq 1602 p?unmlq 1605 LU factorization p?dbtrsv 1746 p?dttrf 1543 p?dttrsv 1748 Intel® Math Kernel Library Reference Manual 2740 p?getf2 1763 matrix equilibration p?geequ 1583 p?poequ 1584 matrix inversion p?getri 1578 p?potri 1580 p?trtri 1581 nonsymmetric eigenvalue problems p?gehrd 1657 p?lahqr 1664 p?ormhr 1659 p?unmhr 1662 QL factorization ?geqlf 1608 ?ungql 1611 p?geql2 1758 p?orgql 1609 p?ormql 1612 p?unmql 1615 QR factorization p?geqpf 1589 p?geqr2 1760 p?ggqrf 1633 p?orgqr 1591 p?ormqr 1594 p?ungqr 1592 p?unmqr 1596 RQ factorization p?gerq2 1762 p?gerqf 1617 p?ggrqf 1636 p?orgrq 1619 p?ormrq 1622 p?ungrq 1620 p?unmrq 1624 RZ factorization p?ormrz 1628 p?tzrzf 1626 p?unmrz 1631 singular value decomposition p?gebrd 1666 p?ormbr 1669 p?unmbr 1672 solution refinement and error estimation p?gerfs 1570 p?porfs 1573 solving linear equations ?dttrsv 1875 ?pttrsv 1876 p?dbtrs 1553 p?dttrs 1555 p?gbtrs 1551 p?getrs 1550 p?potrs 1557 p?pttrs 1560 p?trtrs 1562 symmetric eigenproblems p?hetrd 1646 p?ormtr 1643 p?stebz 1651 p?stein 1653 p?sytrd 1640 p?unmtr 1648 symmetric eigenvalue problems ?stein2 1870 ?steqr2 1878 trapezoidal matrix 1828 triangular factorization ?dbtrf 1873 ?dttrf 1874 p?dbtrsv 1746 p?dttrsv 1748 p?gbtrf 1540 p?getrf 1538 p?pbtrf 1546 p?potrf 1545 p?pttrf 1548 triangular system of equations 1826 updating sum of squares 1819 utility functions and routines p?labad 1879 p?lachkieee 1880 p?lamch 1881 p?lasnbt 1882 pxerbla 1882, 2530 scalar-matrix product 119, 122, 128, 333, 2418, 2420, 2426 scalar-matrix-matrix product general distributed matrix 2418 general matrix 119, 333 symmetric distributed matrix 2426 symmetric matrix 128 triangular distributed matrix 2435 triangular matrix 135 scaling general rectangular matrix 1787 symmetric/Hermitian matrix 1789 scaling factors general rectangular distributed matrix 1583 Hermitian positive definite distributed matrix 1584 symmetric positive definite distributed matrix 1584 scattering compressed sparse vector's elements into full storage form 149 Schur decomposition 894, 896 Schur factorization 1223, 1224, 1259 scsum1 1165 second/dsecnd 2532 Service Functions 1972 Service Routines 2127 SetInternalDecimation 2237 sgbcon 422 sgbrfsx 461 sgbsvx 576 sgbtrs 387 sgecon 420 sgejsv 1045 sgeqpf 676 sgesvj 1051 sgtrfs 467 shgeqz 885 shseqr 851 simple driver 1536 Single Dynamic Library mkl_set_interface_layer 2545 mkl_set_progress 2547 mkl_set_threading_layer 2546 mkl_set_xerbla 2546 single node matrix 1866 singular value decomposition LAPACK 734 LAPACK routines, singular value decomposition 1666 ScaLAPACK 1666, 1723 See also LAPACK routines, singular value decomposition 734 Singular Value Decomposition 1037 sjacobi 2515 sjacobi_delete 2514 sjacobi_init 2512 sjacobi_solve 2513 Index 2741 sjacobix 2516 SkipAheadStream 2148 sla_gbamv 1455 sla_gbrcond 1457 sla_gbrfsx_extended 1462 sla_gbrpvgrw 1467 sla_geamv 1468 sla_gercond 1470 sla_gerfsx_extended 1473 sla_lin_berr 1488 sla_porcond 1489 sla_porfsx_extended 1493 sla_porpvgrw 1498 sla_rpvgrw 1503 sla_syamv 1505 sla_syrcond 1507 sla_syrfsx_extended 1511 sla_syrpvgrw 1516 sla_wwaddw 1517 slag2d 1428 slapmr 1260 slapmt 1262 slarfb 1295 slarft 1300 slarscl2 1504 slartgp 1324 slartgs 1326 slascl2 1504 slatps 1383 slatrd 1385 slatrs 1387 slatrz 1390 slauu2 1392 slauum 1393 small subdiagonal element 1818 smallest absolute value of a vector element 72 sNewAbstractStream 2135 solver direct 2629 iterative 2629 Solver Sparse 1885 solving linear equations 387 solving linear equations. linear equations 1551 solving linear equations. See linear equations 1232 sorbdb 925 sorcsd 1060 sorg2l 1394 sorg2r 1395 sorgl2 1396 sorgr2 1397 sorm2l 1399 sorm2r 1400 sorml2 1402 sormr2 1404 sormr3 1405 sorting eigenpairs 1868 numbers in increasing/decreasing order LAPACK 1371 ScaLAPACK 1869 Sparse BLAS Level 1 data types 140 naming conventions 140 Sparse BLAS Level 1 routines and functions ?axpyi 141 ?dotci 144 ?doti 143 ?dotui 145 ?gthr 146 ?gthrz 147 ?roti 148 ?sctr 149 Sparse BLAS Level 2 naming conventions 151 sparse BLAS Level 2 routines mkl_?bsrgemv 164 mkl_?bsrmv 218 mkl_?bsrsv 232 mkl_?bsrsymv 173 mkl_?bsrtrsv 184 mkl_?coogemv 166 mkl_?coomv 225 mkl_?coosv 239 mkl_?coosymv 176 mkl_?cootrsv 186 mkl_?cscmv 222 mkl_?cscsv 235 mkl_?csrgemv 161 mkl_?csrmv 215 mkl_?csrsv 228 mkl_?csrsymv 171 mkl_?csrtrsv 181 mkl_?diagemv 169 mkl_?diamv 272 mkl_?diasv 278 mkl_?diasymv 178 mkl_?diatrsv 189 mkl_?skymv 275 mkl_?skysv 281 mkl_cspblas_?bsrgemv 194 mkl_cspblas_?bsrsymv 202 mkl_cspblas_?bsrtrsv 209 mkl_cspblas_?coogemv 197 mkl_cspblas_?coosymv 204 mkl_cspblas_?cootrsv 212 mkl_cspblas_?csrgemv 192 mkl_cspblas_?csrsymv 199 mkl_cspblas_?csrtrsv 207 Sparse BLAS Level 3 naming conventions 151 sparse BLAS Level 3 routines mkl_?bsrmm 246 mkl_?bsrsm 268 mkl_?coomm 254 mkl_?coosm 265 mkl_?cscmm 250 mkl_?cscsm 261 mkl_?csradd 316 mkl_?csrmm 242 mkl_?csrmultcsr 320 mkl_?csrmultd 324 mkl_?csrsm 257 mkl_?diamm 284 mkl_?diasm 291 mkl_?skymm 288 mkl_?skysm 295 sparse BLAS routines mkl_?csrbsr 304 mkl_?csrcoo 301 mkl_?csrcsc 307 mkl_?csrdia 309 mkl_?csrsky 313 mkl_?dnscsr 298 sparse matrices 151 sparse matrix 151 Sparse Matrix Storage Formats 152 sparse solver parallel direct sparse solver interface pardiso 1886 Intel® Math Kernel Library Reference Manual 2742 pardiso_64 1903 pardiso_getenv 1904 pardiso_setenv 1904 pardisoinit 1902 Sparse Solver direct sparse solver interface dss_create 1916 dss_define_structure dss_define_structure 1918 dss_delete 1926 dss_factor 1921 dss_factor_complex 1921 dss_factor_real 1921 dss_reorder 1920 dss_solve 1923 dss_solve_complex 1923 dss_solve_real 1923 dss_statistics 1927 mkl_cvt_to_null_terminated_str 1930 iterative sparse solver interface dcg 1946 dcg_check 1946 dcg_get 1948 dcg_init 1945 dcgmrhs 1950 dcgmrhs_check 1949 dcgmrhs_get 1952 dcgmrhs_init 1948 dfgmres 1954 dfgmres_check 1953 dfgmres_get 1956 dfgmres_init 1952 preconditioners based on incomplete LU factorization dcsrilu0 1961 dcsrilut 1963 Sparse Solvers 1905 sparse vectors adding and scaling 141 complex dot product, conjugated 144 complex dot product, unconjugated 145 compressed form 140 converting to compressed form 146, 147 converting to full-storage form 149 full-storage form 140 Givens rotation 148 norm 140 passed to BLAS level 1 routines 140 real dot product 143 scaling 140 spbtf2 1407 specific hardware support mkl_enable_instructions 2544 Spline Methods 2606 split Cholesky factorization (band matrices) 831 sporfsx 472 spotf2 1408 spprfs 478 spptrs 396 sptts2 1409 square matrix 1-norm estimation LAPACK 1184, 1185 ScaLAPACK 1768 srscl 1411 ssyconv 436 ssygs2 1415 ssyswapr 1411 ssyswapr1 1414 ssytd2 1417 ssytf2 1418 ssytri2 523 ssytri2x 527 ssytrs2 406 stgex2 1421 stgsy2 1423 stream 2123 strexc 868 stride. increment 2645 strnlsp_check 2499 strnlsp_delete 2503 strnlsp_get 2502 strnlsp_init 2497 strnlsp_solve 2500 strnlspbc_check 2506 strnlspbc_delete 2511 strnlspbc_get 2510 strnlspbc_init 2505 strnlspbc_solve 2508 strti2 1426 sum of distributed vectors 2378 of magnitudes of elements of a distributed vector 2377 of magnitudes of the vector elements 54 of sparse vector and full-storage vector 141 of vectors 55, 327 sum of squares updating LAPACK 1372 ScaLAPACK 1819 summary statistics vsldsscompute 2302 vsldSSCompute 2302 vsldsseditcorparameterization 2298 vsldSSEditCorParameterization 2298 vsldsseditcovcor 2280 vsldSSEditCovCor 2280 vsldsseditmissingvalues 2294 vsldSSEditMissingValues 2294 vsldsseditmoments 2278 vsldSSEditMoments 2278 vsldsseditoutliersdetection 2292 vsldSSEditOutliersDetection 2292 vsldsseditpartialcovcor 2282 vsldSSEditPartialCovCor 2282 vsldsseditpooledcovariance 2287 vsldSSEditPooledCovariance 2287 vsldsseditquantiles 2284 vsldSSEditQuantiles 2284 vsldsseditrobustcovariance 2289 vsldSSEditRobustCovariance 2289 vsldsseditstreamquantiles 2286 vsldSSEditStreamQuantiles 2286 vsldssedittask 2270 vsldSSEditTask 2270 vsldssnewtask 2267 vsldSSNewTask 2267 vslissedittask 2270 vsliSSEditTask 2270 vslssdeletetask 2303 vslSSDeleteTask 2303 vslssscompute 2302 vslsSSCompute 2302 vslssseditcorparameterization 2298 vslsSSEditCorParameterization 2298 vslssseditcovcor 2280 vslsSSEditCovCor 2280 vslssseditmissingvalues 2294 vslsSSEditMissingValues 2294 Index 2743 vslssseditmoments 2278 vslsSSEditMoments 2278 vslssseditoutliersdetection 2292 vslsSSEditOutliersDetection 2292 vslssseditpartialcovcor 2282 vslsSSEditPartialCovCor 2282 vslssseditpooledcovariance 2287 vslsSSEditPooledCovariance 2287 vslssseditquantiles 2284 vslsSSEditQuantiles 2284 vslssseditrobustcovariance 2289 vslsSSEditRobustCovariance 2289 vslssseditstreamquantiles 2286 vslsSSEditStreamQuantiles 2286 vslsssedittask 2270 vslsSSEditTask 2270 vslsssnewtask 2267 vslsSSNewTask 2267 summary statistics usage examples 2304 support functions mkl_free 2540 mkl_malloc 2539 mkl_mem_stat 2538 mkl_progress 2542 support routines mkl_disable_fast_mm 2538 mkl_free_buffers 2536 mkl_thread_free_buffers 2537 progress information 2542 SVD (singular value decomposition) LAPACK 734 ScaLAPACK 1666 swapping adjacent diagonal blocks 1213, 1421 swapping distributed vectors 2385 swapping vectors 70 Sylvester's equation 874 symmetric band matrix 1-norm value 1247 Frobenius norm 1247 infinity- norm 1247 largest absolute value of element 1247 symmetric distributed matrix rank-n update 2428, 2430 scalar-matrix-matrix product 2426 Symmetric Eigenproblems 948 symmetric indefinite matrix factorization with diagonal pivoting method 1418 matrix-vector product 1505 symmetric matrix Bunch-Kaufman factorization packed storage 381 eigenvalues and eigenvectors 1704, 1706, 1708 estimating the condition number packed storage 439 generalized eigenvalue problems 819 inverting the matrix packed storage 530 matrix-vector product band storage 95 packed storage 98, 1159 rank-1 update packed storage 99, 1161 rank-2 update packed storage 101 rank-2k update 133 rank-n update 131 reducing to standard form LAPACK 1415 ScaLAPACK 1859 reducing to tridiagonal form LAPACK 1385 ScaLAPACK 1823 scalar-matrix-matrix product 128 scaling 1789 solving systems of linear equations packed storage 409 symmetric matrix in packed form 1-norm value 1249 Frobenius norm 1249 infinity- norm 1249 largest absolute value of element 1249 symmetric positive definite distributed matrix computing scaling factors 1584 equilibration 1584 symmetric positive semidefinite matrix Cholesky factorization 366 symmetric positive-definite band matrix Cholesky factorization 1407 symmetric positive-definite distributed matrix inverting the matrix 1580 symmetric positive-definite matrix Cholesky factorization band storage 371, 1546 LAPACK 1408 packed storage 369 ScaLAPACK 1545, 1857 estimating the condition number band storage 430 packed storage 428 tridiagonal matrix 432 inverting the matrix packed storage 519 solving systems of linear equations band storage 398, 1558 LAPACK 393 packed storage 396 ScaLAPACK 1557 symmetric positive-definite tridiagonal matrix solving systems of linear equations 1560 system of linear equations with a distributed triangular matrix 2413 with a triangular matrix band storage 109 packed storage 113 systems of linear equations linear equations 1875 systems of linear equationslinear equations 1550 syswapr 1411 syswapr1 1414 sytri2 523 sytri2x 527 T Task Computation Routines 2606 Task Creation and Initialization NewTask1d 2592 Task Status 2590 threading control mkl_domain_get_max_threads 2527 mkl_domain_set_num_threads 2525 mkl_get_dynamic 2528 mkl_get_max_threads 2526 mkl_set_dynamic 2526 mkl_set_num_threads 2524 Threading Control 2524 timing functions mkl_get_clocks_frequency 2535 MKL_Get_Cpu_Clocks 2533 Intel® Math Kernel Library Reference Manual 2744 mkl_get_cpu_frequency 2534 mkl_get_max_cpu_frequency 2534 second/dsecnd 2532 TR routines ?trnlsp_check 2499 ?trnlsp_delete 2503 ?trnlsp_get 2502 ?trnlsp_init 2497 ?trnlsp_solve 2500 ?trnlspbc_check 2506 ?trnlspbc_delete 2511 ?trnlspbc_get 2510 ?trnlspbc_init 2505 ?trnlspbc_solve 2508 nonlinear least squares problem with linear bound constraints 2504 without constraints 2496 organization and implementation 2495 transposition distributed complex matrix 2433 distributed complex matrix, conjugated 2434 distributed real matrix 2432 Transposition and General Memory Movement Routines 327 transposition parameter 2648 trapezoidal matrix 1-norm value 1257 Frobenius norm 1257 infinity- norm 1257 largest absolute value of element 1257 reduction to triangular form 1828 RZ factorization LAPACK 720 ScaLAPACK 1626 trexc 868 triangular band matrix 1-norm value 1255 Frobenius norm 1255 infinity- norm 1255 largest absolute value of element 1255 triangular banded equations LAPACK 1380 ScaLAPACK 1851 triangular distributed matrix inverting the matrix 1581 scalar-matrix-matrix product 2435 triangular factorization band matrix 359, 1540, 1542, 1746, 1873 diagonally dominant tridiagonal matrix LAPACK 363 general matrix 357, 1538 Hermitian matrix packed storage 383 Hermitian positive semidefinite matrix 366 Hermitian positive-definite matrix band storage 371, 1546 packed storage 369 tridiagonal matrix 373, 1548 symmetric matrix packed storage 381 symmetric positive semidefinite matrix 366 symmetric positive-definite matrix band storage 371, 1546 packed storage 369 tridiagonal matrix 373, 1548 tridiagonal matrix LAPACK 361 ScaLAPACK 1874 triangular matrix 1-norm value LAPACK 1257 ScaLAPACK 1783 copying 1444–1446, 1448–1450 estimating the condition number band storage 447 packed storage 445 Frobenius norm LAPACK 1257 ScaLAPACK 1783 infinity- norm LAPACK 1257 ScaLAPACK 1783 inverting the matrix LAPACK 1426 packed storage 536 ScaLAPACK 1864 largest absolute value of element LAPACK 1257 ScaLAPACK 1783 matrix-vector product band storage 107 packed storage 112 product blocked algorithm 1393, 1831 LAPACK 1392, 1393 ScaLAPACK 1830, 1831 unblocked algorithm 1392 ScaLAPACK 1656 scalar-matrix-matrix product 135 solving systems of linear equations band storage 109, 418 packed storage 113, 416 ScaLAPACK 1562 swapping adjacent diagonal blocks 1421 triangular matrix factorization Hermitian positive-definite matrix 364 symmetric positive-definite matrix 364 triangular matrix in packed form 1-norm value 1256 Frobenius norm 1256 infinity- norm 1256 largest absolute value of element 1256 triangular system of equations solving with scale factor LAPACK 1387 ScaLAPACK 1826 tridiagonal matrix estimating the condition number 424 solving systems of linear equations ScaLAPACK 1875 tridiagonal system of equations 1409 tridiagonal triangular factorization band matrix 1748 tridiagonal triangular system of equations 1854 trigonometric transform backward cosine 2442 backward sine 2442 backward staggered cosine 2443 backward staggered sine 2442 backward twice staggered cosine 2443 backward twice staggered sine 2442 forward cosine 2442 forward sine 2442 forward staggered cosine 2443 forward staggered sine 2442 forward twice staggered cosine 2443 forward twice staggered sine 2442 Trigonometric Transform interface routines ?_backward_trig_transform 2450 Index 2745 ?_commit_trig_transform 2446 ?_forward_trig_transform 2448 ?_init_trig_transform 2445 free_trig_transform 2451 Trigonometric Transforms interface 2445 TT interface 2441 TT routines 2445 two matrices QR factorization LAPACK 728 ScaLAPACK 1633 U ungbr 747 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 unitary matrix CS decomposition LAPACK 920, 925, 1060 from LQ factorization LAPACK 1396 ScaLAPACK 1836 from QL factorization LAPACK 1394, 1399 ScaLAPACK 1833, 1840 from QR factorization LAPACK 1395 ScaLAPACK 1835 from RQ factorization LAPACK 1397 ScaLAPACK 1838 ScaLAPACK 1656, 1666 Unpack Functions 1972 updating rank-1 distributed general matrix 2391 distributed Hermitian matrix 2399 distributed symmetric matrix 2406 general matrix 79 Hermitian matrix 87, 92 real symmetric matrix 99, 104 rank-1, conjugated distributed general matrix 2393 general matrix 81 rank-1, unconjugated distributed general matrix 2394 general matrix 82 rank-2 distributed Hermitian matrix 2400 distributed symmetric matrix 2407 Hermitian matrix 89, 94 symmetric matrix 101, 106 rank-2k Hermitian distributed matrix 2424 Hermitian matrix 126 symmetric distributed matrix 2430 symmetric matrix 133 rank-k distributed Hermitian matrix 2422 Hermitian matrix 124 symmetric distributed matrix 2428 rank-n symmetric matrix 131 updating:rank-1 Hermitian matrix packed storage 92 real symmetric matrix packed storage 99 updating:rank-2 Hermitian matrix packed storage 94 symmetric matrix packed storage 101 upper Hessenberg matrix 1-norm value LAPACK 1246 ScaLAPACK 1780 Frobenius norm LAPACK 1246 ScaLAPACK 1780 infinity- norm LAPACK 1246 ScaLAPACK 1780 largest absolute value of element LAPACK 1246 ScaLAPACK 1780 ScaLAPACK 1656 user time 1529 V v?Abs 1989 v?Acos 2042 v?Acosh 2061 v?Add 1976 v?Arg 1991 v?Asin 2045 v?Asinh 2064 v?Atan 2047 v?Atan2 2050 v?Atanh 2067 v?Cbrt 2004 v?CdfNorm 2075 v?CdfNormInv 2082 v?Ceil 2089 v?CIS 2038 v?Conj 1987 v?Cos 2031 v?Cosh 2052 v?Div 1997 v?Erf 2070 v?Erfc 2073 v?ErfcInv 2080 v?ErfInv 2077 v?Exp 2019 v?Expm1 2022 v?Floor 2088 v?Hypot 2017 v?Inv 1995 v?InvCbrt 2006 v?InvSqrt 2002 v?lgamma 2084 v?LGamma 2084 v?LinearFrac 1993 v?Ln 2024 v?Log10 2027 v?Log1p 2030 v?Modf 2098 v?Mul 1983 v?MulByConj 1986 v?NearbyInt 2094 v?Pack 2100 v?Pow 2011 v?Pow2o3 2007 v?Pow3o2 2009 Intel® Math Kernel Library Reference Manual 2746 v?Powx 2014 v?Rint 2096 v?Round 2093 v?Sin 2034 v?SinCos 2036 v?Sinh 2055 v?Sqr 1981 v?Sqrt 2000 v?Sub 1979 v?Tan 2040 v?Tanh 2058 v?tgamma 2086 v?TGamma 2086 v?Trunc 2091 v?Unpack 2103 vcAdd 1976 vcPackI 2100 vcPackM 2100 vcPackV 2100 vcSin 2034 vcSub 1979 vcUnpackI 2103 vcUnpackM 2103 vcUnpackV 2103 vdAdd 1976 vdlgamma 2084 vdLGamma 2084 vdPackI 2100 vdPackM 2100 vdPackV 2100 vdSin 2034 vdSub 1979 vdtgamma 2086 vdTGamma 2086 vdUnpackI 2103 vdUnpackM 2103 vdUnpackV 2103 vector arguments array dimension 2645 default 2646 examples 2645 increment 2645 length 2645 matrix one-dimensional substructures 2645 sparse vector 140 vector conjugation 1155, 1743 vector indexing 1973 vector mathematical functions absolute value 1989 addition 1976 argument 1991 complementary error function value 2073 complex exponent of real vector elements 2038 computing a rounded integer value and raising inexact result exception 2096 computing a rounded integer value in current rounding mode 2094 computing a truncated integer value 2098 conjugation 1987 cosine 2031 cube root 2004 cumulative normal distribution function value 2075 denary logarithm 2027 division 1997 error function value 2070 exponential 2019 exponential of elements decreased by 1 2022 four-quadrant arctangent 2050 gamma function 2084, 2086 hyperbolic cosine 2052 hyperbolic sine 2055 hyperbolic tangent 2058 inverse complementary error function value 2080 inverse cosine 2042 inverse cube root 2006 inverse cumulative normal distribution function value 2082 inverse error function value 2077 inverse hyperbolic cosine 2061 inverse hyperbolic sine 2064 inverse hyperbolic tangent 2067 inverse sine 2045 inverse square root 2002 inverse tangent 2047 inversion 1995 linear fraction transformation 1993 multiplication 1983 multiplication of conjugated vector element 1986 natural logarithm 2024 natural logarithm of vector elements increased by 1 2030 power 2011 power (constant) 2014 power 2/3 2007 power 3/2 2009 rounding to nearest integer value 2093 rounding towards minus infinity 2088 rounding towards plus infinity 2089 rounding towards zero 2091 scaling 1504 scaling, reciprocal 1504 sine 2034 sine and cosine 2036 square root 2000 square root of sum of squares 2017 squaring 1981 subtraction 1979 tangent 2040 Vector Mathematical Functions vector multiplication LAPACK 1411 ScaLAPACK 1858 vector pack function 2100 vector statistics functions Bernoulli 2195 Beta 2186 Binomial 2198 Cauchy 2173 CopyStream 2138 CopyStreamState 2139 DeleteStream 2137 dNewAbstractStream 2133 Exponential 2165 Gamma 2183 Gaussian 2159 GaussianMV 2161 Geometric 2196 GetBrngProperties 2210 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 Gumbel 2181 Hypergeometric 2200 iNewAbstractStream 2131 Laplace 2168 LeapfrogStream 2146 LoadStreamF 2141 LoadStreamM 2144 Lognormal 2178 NegBinomial 2206 Index 2747 NewStream 2128 NewStreamEx 2129 Poisson 2202 PoissonV 2204 Rayleigh 2175 RegisterBrng 2209 SaveStreamF 2140 SaveStreamM 2142 SkipAheadStream 2148 sNewAbstractStream 2135 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 Weibull 2170 vector unpack function 2103 vector-scalar product sparse vectors 141 vectors adding magnitudes of vector elements 54 copying 56 dot product complex vectors 61 complex vectors, conjugated 60 real vectors 58 element with the largest absolute value 71 element with the largest absolute value of real part and its index 1745 element with the smallest absolute value 72 Euclidean norm 62 Givens rotation 64 linear combination of vectors 55, 327 modified Givens transformation parameters 67 rotation of points 63 rotation of points in the modified plane 65 sparse vectors 140 sum of vectors 55, 327 swapping 70 vector-scalar product 69 viRngUniformBits 2191 viRngUniformBits32 2192 viRngUniformBits64 2193 vmcAdd 1976 vmcSin 2034 vmcSub 1979 vmdAdd 1976 vmdSin 2034 vmdSub 1979 vml Functions Interface 1971 Input Parameters 1972 Output Parameters 1973 VML 1969 VML arithmetic functions 1976 VML exponential and logarithmic functions 2019 VML functions mathematical functions v?Abs 1989 v?Acos 2042 v?Acosh 2061 v?Add 1976 v?Arg 1991 v?Asin 2045 v?Asinh 2064 v?Atan 2047 v?Atan2 2050 v?Atanh 2067 v?Cbrt 2004 v?CdfNorm 2075 v?CdfNormInv 2082 v?Ceil 2089 v?CIS 2038 v?Conj 1987 v?Cos 2031 v?Cosh 2052 v?Div 1997 v?Erf 2070 v?Erfc 2073 v?ErfcInv 2080 v?ErfInv 2077 v?Exp 2019 v?Expm1 2022 v?Floor 2088 v?Hypot 2017 v?Inv 1995 v?InvCbrt 2006 v?InvSqrt 2002 v?LGamma 2084 v?LinearFrac 1993 v?Ln 2024 v?Log10 2027 v?Log1p 2030 v?Modf 2098 v?Mul 1983 v?MulByConj 1986 v?NearbyInt 2094 v?Pow 2011 v?Pow2o3 2007 v?Pow3o2 2009 v?Powx 2014 v?Rint 2096 v?Round 2093 v?Sin 2034 v?SinCos 2036 v?Sinh 2055 v?Sqr 1981 v?Sqrt 2000 v?Sub 1979 v?Tan 2040 v?Tanh 2058 v?TGamma 2086 v?Trunc 2091 pack/unpack functions v?Pack 2100 v?Unpack 2103 service functions ClearErrorCallBack 2114 ClearErrStatus 2111 GetErrorCallBack 2114 GetErrStatus 2110 GetMode 2108 SetErrorCallBack 2111 SetErrStatus 2109 SetMode 2106 VML hyperbolic functions 2052 VML mathematical functions arithmetic 1976 exponential and logarithmic 2019 hyperbolic 2052 power and root 1995 rounding 2088 special 2070 special value notations 1976 trigonometric 2031 VML Mathematical Functions 1971 VML Pack Functions 1971 VML Pack/Unpack Functions 2100 VML power and root functions 1995 VML rounding functions 2088 Intel® Math Kernel Library Reference Manual 2748 VML Service Functions 2106 VML special functions 2070 VML trigonometric functions 2031 vmlClearErrorCallBack 2114 vmlClearErrStatus 2111 vmlGetErrorCallBack 2114 vmlGetErrStatus 2110 vmlGetMode 2108 vmlSetErrorCallBack 2111 vmlSetErrorStatus 2109 vmlSetMode 2106 vmsAdd 1976 vmsSin 2034 vmsSub 1979 vmzAdd 1976 vmzSin 2034 vmzSub 1979 vsAdd 1976 VSL Fortran header 2115 VSL routines advanced service routines GetBrngProperties 2210 RegisterBrng 2209 convolution/correlation CopyTask 2254 DeleteTask 2253 Exec 2239 Exec1D 2242 ExecX 2246 ExecX1D 2249 NewTask 2220 NewTask1D 2223 NewTaskX 2225 NewTaskX1D 2228 SetInternalPrecision 2234 generator routines Bernoulli 2195 Beta 2186 Binomial 2198 Cauchy 2173 Exponential 2165 Gamma 2183 Gaussian 2159 GaussianMV 2161 Geometric 2196 Gumbel 2181 Hypergeometric 2200 Laplace 2168 Lognormal 2178 NegBinomial 2206 Poisson 2202 PoissonV 2204 Rayleigh 2175 Uniform (continuous) 2156 Uniform (discrete) 2189 UniformBits 2191 UniformBits32 2192 UniformBits64 2193 Weibull 2170 service routines CopyStream 2138 CopyStreamState 2139 DeleteStream 2137 dNewAbstractStream 2133 GetNumRegBrngs 2152 GetStreamSize 2145 GetStreamStateBrng 2151 iNewAbstractStream 2131 LeapfrogStream 2146 LoadStreamF 2141 LoadStreamM 2144 NewStream 2128 NewStreamEx 2129 SaveStreamF 2140 SaveStreamM 2142 SkipAheadStream 2148 sNewAbstractStream 2135 summary statistics Compute 2302 DeleteTask 2303 EditCorParameterization 2298 EditCovCor 2280 EditMissingValues 2294 EditMoments 2278 EditOutliersDetection 2292 EditPartialCovCor 2282 EditPooledCovariance 2287 EditQuantiles 2284 EditRobustCovariance 2289 EditStreamQuantiles 2286 EditTask 2270 NewTask 2267 VSL routines:convolution/correlation SetInternalDecimation 2237 SetMode 2232 SetStart 2235 VSL Summary Statistics 2261 VSL task 2115 vslConvCopyTask 2254 vslCorrCopyTask 2254 vsldsscompute 2302 vsldSSCompute 2302 vsldsseditcorparameterization 2298 vsldSSEditCorParameterization 2298 vsldsseditcovcor 2280 vsldSSEditCovCor 2280 vsldsseditmissingvalues 2294 vsldSSEditMissingValues 2294 vsldsseditmoments 2278 vsldSSEditMoments 2278 vsldsseditoutliersdetection 2292 vsldSSEditOutliersDetection 2292 vsldsseditpartialcovcor 2282 vsldSSEditPartialCovCor 2282 vsldsseditpooledcovariance 2287 vsldSSEditPooledCovariance 2287 vsldsseditquantiles 2284 vsldSSEditQuantiles 2284 vsldsseditrobustcovariance 2289 vsldSSEditRobustCovariance 2289 vsldsseditstreamquantiles 2286 vsldSSEditStreamQuantiles 2286 vsldssedittask 2270 vsldSSEditTask 2270 vsldssnewtask 2267 vsldSSNewTask 2267 vslgamma 2084 vsLGamma 2084 vslissedittask 2270 vsliSSEditTask 2270 vslLoadStreamF 2141 vslSaveStreamF 2140 vslssdeletetask 2303 vslSSDeleteTask 2303 vslssscompute 2302 vslsSSCompute 2302 vslssseditcorparameterization 2298 vslsSSEditCorParameterization 2298 vslssseditcovcor 2280 vslsSSEditCovCor 2280 Index 2749 vslssseditmissingvalues 2294 vslsSSEditMissingValues 2294 vslssseditmoments 2278 vslsSSEditMoments 2278 vslssseditoutliersdetection 2292 vslsSSEditOutliersDetection 2292 vslssseditpartialcovcor 2282 vslsSSEditPartialCovCor 2282 vslssseditpooledcovariance 2287 vslsSSEditPooledCovariance 2287 vslssseditquantiles 2284 vslsSSEditQuantiles 2284 vslssseditrobustcovariance 2289 vslsSSEditRobustCovariance 2289 vslssseditstreamquantiles 2286 vslsSSEditStreamQuantiles 2286 vslsssedittask 2270 vslsSSEditTask 2270 vslsssnewtask 2267 vslsSSNewTask 2267 vsPackI 2100 vsPackM 2100 vsPackV 2100 vsSin 2034 vsSub 1979 vstgamma 2086 vsTGamma 2086 vsUnpackI 2103 vsUnpackM 2103 vsUnpackV 2103 vzAdd 1976 vzPackI 2100 vzPackM 2100 vzPackV 2100 vzSin 2034 vzSub 1979 vzUnpackI 2103 vzUnpackM 2103 vzUnpackV 2103 W Weibull 2170 Wilkinson transform 1832 X xerbla 2529 xerbla_array 1532 xerbla, error reporting routine 1973 Z zbbcsd 920 zdla_gercond_c 1471 zdla_gercond_x 1472 zgbcon 422 zgbrfsx 461 zgbsvx 576 zgbtrs 387 zgecon 420 zgeqpf 676 zgtrfs 467 zhegs2 1415 zheswapr 1413 zhetd2 1417 zhetri2 525 zhetri2x 529 zhetrs2 408 zhgeqz 885 zhseqr 851 zla_gbamv 1455 zla_gbrcond_c 1459 zla_gbrcond_x 1460 zla_gbrfsx_extended 1462 zla_gbrpvgrw 1467 zla_geamv 1468 zla_gerfsx_extended 1473 zla_heamv 1478 zla_hercond_c 1480 zla_hercond_x 1481 zla_herfsx_extended 1482 zla_herpvgrw 1487 zla_lin_berr 1488 zla_porcond_c 1490 zla_porcond_x 1492 zla_porfsx_extended 1493 zla_porpvgrw 1498 zla_rpvgrw 1503 zla_syamv 1505 zla_syrcond_c 1508 zla_syrcond_x 1509 zla_syrfsx_extended 1511 zla_syrpvgrw 1516 zla_wwaddw 1517 zlag2c 1429 zlapmr 1260 zlapmt 1262 zlarfb 1295 zlarft 1300 zlarscl2 1504 zlascl2 1504 zlat2c 1454 zlatps 1383 zlatrd 1385 zlatrs 1387 zlatrz 1390 zlauu2 1392 zlauum 1393 zpbtf2 1407 zporfsx 472 zpotf2 1408 zpprfs 478 zpptrs 396 zptts2 1409 zrscl 1411 zsyconv 436 zsyswapr 1411 zsyswapr1 1414 zsytf2 1418 zsytri2 523 zsytri2x 527 zsytrs2 406 ztgex2 1421 ztgsy2 1423 ztrexc 868 ztrti2 1426 zunbdb 925 zuncsd 1060 zung2l 1394 zung2r 1395 zungbr 747 zungl2 1396 zungr2 1397 zunm2l 1399 zunm2r 1400 zunml2 1402 zunmr2 1404 zunmr3 1405 Intel® Math Kernel Library Reference Manual 2750 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS Document Number: 324207-005US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information................................................................................5 Introducing the Intel(R) VTune(TM) Amplifier XE................................7 Prerequisites........................................................................................9 Navigation Quick Start.......................................................................11 Key Terms and Concepts....................................................................13 Chapter 1: Tutorial: Finding Hotspots Learning Objectives..................................................................................17 Workflow Steps to Identify and Analyze Hotspots.........................................17 Build Application and Create New Project....................................................18 Run Hotspots Analysis..............................................................................21 Interpret Result Data................................................................................22 Analyze Code..........................................................................................25 Tune Algorithms......................................................................................27 Compare with Previous Result....................................................................30 Summary................................................................................................32 Chapter 2: Tutorial: Analyzing Locks and Waits Learning Objectives..................................................................................33 Workflow Steps to Identify Locks and Waits.................................................33 Build Application and Create New Project....................................................34 Run Locks and Waits Analysis....................................................................36 Interpret Result Data................................................................................37 Analyze Code..........................................................................................41 Remove Lock...........................................................................................42 Compare with Previous Result....................................................................45 Summary................................................................................................47 Chapter 3: Tutorial: Identifying Hardware Issues Learning Objectives..................................................................................49 Workflow Steps to Identify Hardware Issues................................................49 Build Application and Create New Project....................................................50 Run General Exploration Analysis...............................................................51 Interpret Results......................................................................................52 Analyze Code..........................................................................................55 Resolve Issue..........................................................................................57 Resolve Next Issue...................................................................................60 Summary................................................................................................63 Chapter 4: More Resources Getting Help............................................................................................65 Product Website and Support.....................................................................65 Chapter 5: Intel(R) VTune(TM) Amplifier XE Tutorials Troubleshooting Troubleshooting.......................................................................................67 Contents 3Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 4Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Java is a registered trademark of Oracle and/or its affiliates. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. 5 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 6Introducing the Intel(R) VTune(TM) Amplifier XE The Intel(R) VTune(TM) Amplifier XE, an Intel(R) Parallel Studio XE tool, provides information on code performance for users developing serial and multithreaded applications on Windows* and Linux* operating systems. On Windows systems, the VTune Amplifier XE integrates into Microsoft Visual Studio* software and is also available as a standalone GUI client. On Linux systems, VTune Amplifier XE works only as a standalone GUI client. On both Windows and Linux systems, you can benefit from using the command-line interface for collecting data remotely or for performing regression testing. VTune Amplifier XE helps you analyze the algorithm choices and identify where and how your application can benefit from available hardware resources. Use the VTune Amplifier XE to locate or determine the following: • The most time-consuming (hot) functions in your application and/or on the whole system • Sections of code that do not effectively utilize available processor time • The best sections of code to optimize for sequential performance and for threaded performance • Synchronization objects that affect the application performance • Whether, where, and why your application spends time on input/output operations • The performance impact of different synchronization methods, different numbers of threads, or different algorithms • Thread activity and transitions • Hardware-related bottlenecks in your code Intel VTune Amplifier XE Tutorials These tutorials tell you how to use the VTune Amplifier XE to analyze the performance of a sample application by identifying software- and hardware-related issues in the code. • Finding Hotspots • Analyzing Locks and Waits • Identifying Hardware Issues Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the printable version (PDF) of product tutorials. See Also Getting Help 7 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 8Prerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Tools You need the following tools to use these tutorials: • Intel(R) VTune(TM) Amplifier XE • Sample code included with the VTune Amplifier XE. VTune Amplifier XE provides the following sample applications: • tachyon application used for the Finding Hotspots and Analyzing Locks and Waits tutorials • matrix application used for the Identifying Hardware Issues tutorial • VTune Amplifier XE Help To acquire the VTune Amplifier XE: If you do not already have access to the VTune Amplifier XE, you can download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/. To install the VTune Amplifier XE, follow the instructions in the Release Notes. To install and set up VTune Amplifier XE sample code: 1. Copy the tachyon_vtune_amp_xe.tar.gz and matrix_vtune_amp_xe.tar.gz files from the samples/ folder in the IntelVTune Amplifier XE installation directory to a writable directory or share on your system. The default installation directory is /opt/intel/vtune_amplifier_xe_2011 . 2. Extract the sample(s) from the .tar file. NOTE • Samples are non-deterministic. Your screens may vary from the screen shots shown throughout these tutorials. • Samples are designed only to illustrate VTune Amplifier XE features and do not represent best practices for tuning the code. Results may vary depending on the nature of the analysis. To run the VTune Amplifier XE: Launch the amplxe-gui script from the /opt/intel/vtune_amplifier_xe_2011/bin32 directory. To access VTune Amplifier XE Help: See the Getting Help topic. 9 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 10Navigation Quick Start Standalone Intel(R) VTune(TM) Amplifier XE Use the VTune Amplifier XE menu to control result collection, define and view project properties, and set various options. Use the VTune Amplifier XE toolbar to configure and control result collection. Use the Project Navigator to manage your VTune Amplifier XE projects and collected analysis results. Click the Project Navigator button on the toolbar to enable/disable the Project Navigator. Use the VTune Amplifier XE result tabs to manage result data. You can view or change the result file location from the Project Properties dialog box. Use the drop-down menu to select a viewpoint, a preset configuration of windows/panes for an analysis result. For each analysis type, you can switch among several preset configurations to focus on particular performance metrics. Click the yellow question mark icon to read the viewpoint description. 11Switch between window tabs to explore the analysis type configuration options and collected data provided by the selected viewpoint. Use the Grouping drop-down menu to choose a granularity level for grouping data in the grid. Use the filter toolbar to filter out the result data according to the selected categories. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 12Key Terms and Concepts Key Terms baseline: A performance metric used as a basis for comparison of the application versions before and after optimization. Baseline should be measurable and reproducible. CPU time: The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed. The application CPU time is the sum of the CPU time of all the threads that run the application. Elapsed time:The total time your target ran, calculated as follows: Wall clock time at end of application – Wall clock time at start of application. hotspot: A section of code that took a long time to execute. Some hotspots may indicate bottlenecks and can be removed, while other hotspots inevitably take a long time to execute due to their nature. target: A target is an executable file you analyze using the Intel(R) VTune(TM) Amplifier XE. viewpoint: A preset result tab configuration that filters out the data collected during a performance analysis and enables you to focus on specific performance problems. When you select a viewpoint, you select a set of performance metrics the VTune Amplifier XE shows in the windows/panes of the result tab. To select the required viewpoint, click the button and use the drop-down menu at the top of the result tab. Wait time: The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits. Key Concept: CPU Usage For the user-mode sampling and tracing analysis types, the Intel(R) VTune(TM) Amplifier XE identifies a processor utilization scale, calculates the target CPU usage, and defines default utilization ranges depending on the number of processor cores. You can change the utilization ranges by dragging the slider in the CPU Usage histogram in the Summary window. Utilizatio n Type Default color Description Idle All CPUs are waiting - no threads are running. Poor Poor usage. By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage. OK Acceptable (OK) usage. By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage. Ideal Ideal usage. By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage. Key Concept: Data of Interest The VTune Amplifier XE maintains a special column called Data of Interest. This column is highlighted with yellow background and a yellow star in the column header . The data in the Data of Interest column is used by various windows as follows: 13• The Call Stack pane calculates the contribution, shown in the contribution bar, using the Data of Interest column values. • The Filter bar uses the data of interest values to calculate the percentage indicated in the filtered option. • The Source/Assembly window uses this column for hotspot navigation. If a viewpoint has more than one column with numeric data or bars, you can change the default Data of Interest column by right-clicking the required column and selecting the Set Column as Data of Interest command from the pop-up menu. Key Concept: Event-based Metrics When analyzing data collected during a hardware event-based sampling analysis, the VTune Amplifier XE uses the performance metrics. Each metric is an event ratio with its own threshold values. As soon as the performance of a program unit per metric exceeds the threshold, the VTune Amplifier XE marks this value as a performance issue (in pink) and provides recommendations how to fix it. Each column in the Bottom-up pane provides data per metric. To read the metric description and see the formula used for the metric calculation, mouse over the metric column header. To read the description of the hardware issue and see the threshold formula used for this issue, mouse over the link cell in the grid. For the full list of metrics used by the VTune Amplifier XE, see the Hardware Event-based Metrics topic in the online help. Key Concept: Event-based Sampling Analysis VTune Amplifier XE introduces a set of advanced hardware analysis types based on the event-based sampling data collection and targeted for the Intel(R) Core(TM) 2 processor family, processors based on the Intel(R) microarchitecture code name Nehalem and Intel(R) microarchitecture code name Sandy Bridge. Depending on the analysis type, the VTune Amplifier XE monitors a set of hardware events and, as a result, provides collected data per, so-called, hardware event-based metrics defined by Intel architects (for example, Clockticks per Instructions Retired, Contested Accesses, and so on). Typically, you are recommended to start with the General Exploration analysis type that collects the maximum number of events and provides the widest picture of the hardware issues that affected the performance of your application. For more information on the event-based sampling analysis, see the Hardware Event-based Sampling Collection topic in the online help. Key Concept: Event Skid Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function. Event skid is caused by a number of factors: • The delay in propagating the event out of the processor's microcode through the interrupt controller (APIC) and back into the processor. • The current instruction retirement cycle must be completed. • When the interrupt is received, the processor must serialize its instruction stream which causes a flushing of the execution pipeline. The Intel(R) processors support accurate event location for some events. These events are called precise events.See the online help for more details. Key Concept: Finalization Finalization is the process of the Intel(R) VTune(TM) Amplifier XE converting the collected data to a database, resolving symbol information, and pre-computing data to make further analysis more efficient and responsive. The VTune Amplifier XE finalizes data automatically when data collection completes. You may want to re-finalize a result to: Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 14• update symbol information after changes in the search directories settings • resolve the number of [Unknown]-s in the results Key Concept: Hotspots Analysis The Hotspots analysis helps understand the application flow and identify sections of code that took a long time to execute (hotspots). A large number of samples collected at a specific process, thread, or module can imply high processor utilization and potential performance bottlenecks. Some hotspots can be removed, while other hotspots are fundamental to the application functionality and cannot be removed. The Intel(R)VTune(TM) Amplifier XE creates a list of functions in your application ordered by the amount of time spent in a function. It also detects the call stacks for each of these functions so you can see how the hot functions are called. The VTune Amplifier XE uses a low overhead (about 5%) user-mode sampling and tracing collection that gets you the information you need without slowing down the application execution significantly. Key Concept: Locks and Waits Analysis While the Concurrency analysis helps identify where your application is not parallel, the Locks and Waits analysis helps identify the cause of the ineffective processor utilization. One of the most common problems is threads waiting too long on synchronization objects (locks). Performance suffers when waits occur while cores are under-utilized. During the Locks and Waits analysis you can estimate the impact each synchronization object introduces to the application and understand how long the application was required to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. Key Concept: Thread Concurrency The number of active threads corresponds to the concurrency level of an application. By comparing the concurrency level with the number of processors, Intel(R) VTune(TM) Amplifier XE classifies how an application utilizes the processors in the system. It defines default utilization ranges depending on the number of processor cores and displays the thread concurrency in the Summary and Bottom-up window. You can change the utilization ranges by dragging the slider in the Summary window. Thread concurrency may be higher than CPU Usage if threads are in the runnable state and not consuming CPU time. VTune Amplifier XE defines the Target Concurrency level for your application that is, by default, equal to the number of physical cores. Utilizatio n Type Default color Description Idle All threads in the application are waiting - no threads are running. There can be only one bar in the Thread Concurrency histogram indicating Idle utilization. Poor Poor utilization. By default, poor utilization is when the number of threads is up to 50% of the target concurrency. OK Acceptable (OK) utilization. By default, OK utilization is when the number of threads is between 51-85% of the target concurrency. Ideal Ideal utilization. By default, ideal utilization is when the number of threads is between 86-115% of the target concurrency. Over Over-utilization. By default, over-utilization is when the number of threads is more than 115% of the target concurrency. Key Terms and Concepts 15 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 16Tutorial: Finding Hotspots 1 Learning Objectives This tutorial shows how to use the Hotspots analysis of the Intel(R) VTune(TM) Amplifier XE to understand where the sample application is spending time, identify hotspots - the most time-consuming program units, and detect how they were called. Some hotspots may indicate bottlenecks that can be removed, while other hotspots are inevitable and take a long time to execute due to their nature. Typically, the hotspot functions identified during the Hotspots analysis use the most time-consuming algorithms and are good candidates for parallelization. The Hotspots analysis is useful to analyze the performance of both serial and parallel applications. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Hotspots analysis type. • Run the Hotspots analysis to locate most time-consuming functions in an application. • Analyze the function call flow and threads. • Analyze the source code to locate the most time-critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify and Analyze Hotspots Workflow Steps to Identify and Analyze Hotspots You can use the Intel(R) VTune(TM) Amplifier XE to identify and analyze hotspot functions in your serial or parallel application by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 171. Build an application to analyze for hotspots and create a new VTune Amplifier XE project 2. Choose and run the Hotspots analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to tune the algorithms. 6. Re-build the target, re-run the Hotspots analysis, and compare the result data before and after optimization. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Build Application and Create New Project Before you start analyzing your application target for hotspots, do the following: 1. Build application in the release mode with full optimizations. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/ tachyon_vtune_amp_xe). Make sure this directory contains Makefile. 2. Clean up all the previous builds as follows: $ make clean 3. Build your target in the release mode as follows: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 18$ make release The tachyon_find_hotspots application is built. Create a Performance Baseline 1. Run tachyon_find_hotspots with dat/balls.dat as an input parameter. For example: $ /home/intel/samples/tachyon_vtune_amp_xe/tachyon_find_hotspots dat/balls.dat The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2. Note the execution time displayed in the window caption or in the shell window. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 83.539 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. Tutorial: Finding Hotspots 1 19NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit 2. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script launching the VTune Amplifier XE GUI. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name tachyon that will be used as the project directory name. The VTune Amplifier XE creates the tachyon project directory under the root/intel/My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 5. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: /tachyon_find_hotspots, for example: / home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_find_hotspots. • For the Application parameters field, enter dat/balls.dat. 6. Click OK to apply the settings and exit the Project Properties dialog box. Recap You built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline Next Step Run Hotspots Analysis 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 20Run Hotspots Analysis Before running an analysis, choose a configuration level to influence Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the Hotspots analysis to identify the hotspots that took much time to execute. To run an analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. On the left pane of the Analysis Type window, locate the analysis tree and select Algorithm Analysis > Hotspots. The right pane is updated with the default options for the Hotspots analysis. 3. Click the Start button on the right command bar. VTune Amplifier XE launches the tachyon_find_hotspots application that renders balls.dat as an input file, calculates the execution time, and exits. VTune Amplifier XE finalizes the collected results and opens the Hotspots viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You launched the Hotspots data collection that analyzes function calls and CPU time spent in each program unit of your application. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: hotspot, Elapsed time, viewpoint • Concept: Hotspot Analysis, Finalization Tutorial: Finding Hotspots 1 21Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Hotspots viewpoint that consists of the Summary, Bottom-up, and Top-down Tree windows. To interpret the data on the sample code performance, do the following: • Understand the basic performance metrics provided by the Hotspots analysis. • Analyze the most time-consuming functions. • Analyze CPU usage per function. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Basic Hotspots Metrics Start analysis with the Summary window. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. Note that CPU Time for the sample application is equal to 89.876 seconds. It is the sum of CPU time for all application threads. Total Thread Count is 1, so the sample application is single-threaded. The Top Hotspots section provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution. For the sample application, the initialize_2D_buffer function, which took 52.939 seconds to execute, shows up at the top of the list as the hottest function. The [Others] entry at the bottom shows the sum of CPU time for all functions not listed in the table. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 22Analyze the Most Time-consuming Functions Click the Bottom-up tab to explore the Bottom-up pane. By default, the data in the grid is sorted by Function. You may change the grouping level using the Grouping drop-down menu at the top of the grid. Analyze the CPU Time column values. This column is marked with a yellow star as the Data of Interest column. It means that the VTune Amplifier XE uses this type of data for some calculations (for example, filtering, stack contribution, and others). Functions that took most CPU time to execute are listed on top. The initialize_2D_buffer function took 52.939 seconds to execute. Click the arrow sign at the initialize_2D_buffer function to expand the stacks calling this function. You see that it was called only by the setup_2D_buffer function. Select the initialize_2D_buffer function in the grid and explore the data provided in the Call Stack pane on the right. The Call Stack pane displays full stack data for each hotspot function, enables you to navigate between function call stacks and understand the impact of each stack to the function CPU time. The stack functions in the Call Stack pane are represented in the following format: ! - :, where the line number corresponds to the line calling the next function in the stack. For the sample application, the hottest function is called at line 87 of the setup_2D_buffer function in the global.cpp file. Analyze CPU Usage per Function VTune Amplifier XE enables you to analyze the collected data from different perspectives by using multiple viewpoints. For the Hotspots analysis result, you may switch to the Hotspots by CPU Usage viewpoint to understand how your hotspot function performs in Tutorial: Finding Hotspots 1 23terms of the CPU usage. Explore this viewpoint to determine how your application utilized available cores and identify the most serial code. If you go back to the Summary window, you can see the CPU Usage Histogram that represents the Elapsed time and usage level for the available logical processors. The tachyon_find_hotspots application ran mostly on one logical CPU. If you hover over the highest bar, you see that it spent 79.695 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization for a dual-core system. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. To get the detailed CPU usage information per function, use the button in the Bottom-up window to expand the CPU Time column. Note that initialize_2D_buffer is the function with the longest poor CPU utilization (red bars). This means that the processor cores were underutilized most of the time spent on executing this function. If you change the grouping level (highlighted in the figure above) in the Bottom-up pane from Function/ Call Stack to Thread/Function/Call Stack, you see that the initialize_2D_buffer function belongs to the thread_video thread. This thread is also identified as a hotspot and shows up at the top in the Bottomup pane. To get detailed information on the hotspot thread performance, explore the Timeline pane. Timeline area. When you hover over the graph element, the timeline tooltip displays the time passed since the application has been launched. Threads area that shows the distribution of CPU time utilization per thread. Hover over a bar to see the CPU time utilization in percent for this thread at each moment of time. Green zones show the time threads are active. CPU Usage area that shows the distribution of CPU time utilization for the whole application. Hover over a bar to see the application-level CPU time utilization in percent at each moment of time. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 24VTune Amplifier XE calculates the overall CPU Usage metric as the sum of CPU time per each thread of the Threads area. Maximum CPU Usage value is equal to [number of processor cores] x 100%. The Timeline analysis also identifies the thread_video thread as the most active. The tooltip shows that CPU time values are about 100% whereas the maximum CPU time value for dual-core systems is 200%. This means that the processor cores were half-utilized for most of the time spent on executing the tachyon_find_hotspots application. Recap You identified a function that took the most CPU time and could be a good candidate for algorithm tuning. Key Terms and Concepts • Term: Elapsed time, CPU time, viewpoint • Concept: Hotspots Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified initialize_2D_buffer as the hottest function. In the Bottom-up pane, double-click this function to open the Source window and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source Window Options The table below explains some of the features available in the Source window when viewing the Hotspots analysis data. Tutorial: Finding Hotspots 1 25Source pane displaying the source code of the application if the function symbol information is available. The code line that took the most CPU time to execute is highlighted. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected hotspot function. To enable the Source pane, make sure to build the target properly. Assembly pane displaying the assembler instructions for the selected hotspot function. Assembler instructions are grouped by basic blocks. The assembler instructions for the selected hotspot function are highlighted. To get help on an assembler instruction, right-click the instruction and select Instruction Reference. NOTE To get the help on a particular instruction, make sure to have the Adobe* Acrobat Reader* 9 (or later) installed. If an earlier version of the Adobe Acrobat Reader is installed, the Instruction Reference opens but you need to locate the help on each instruction manually. Processor time attributed to a particular code line. If the hotspot is a system function, its time, by default, is attributed to the user function that called this system function. Source window toolbar. Use the hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Hotspots analysis, this is CPU Time. Use the Source/Assembly buttons to toggle the Source/Assembly panes (if both of them are available) on/off. Heat map markers to quickly identify performance-critical code lines (hotspots). The bright blue markers indicate hot lines for the function you selected for analysis. Light blue markers indicate hot lines for other functions. Scroll to a marker to locate the hot code line it identifies. Identify the Hottest Code Lines When you identify a hotspot in the serial code, you can make some changes in the code to tune the algorithms and speed up that hotspot. Another option is to parallelize the sample code by adding threads to the application so that it performs well on multi-core processors. This tutorial focuses on algorithm tuning. By default, when you double-click the hotspot in the Bottom-up pane, VTune Amplifier XE opens the source file related to this function. For the initialize_2D_buffer function, the hottest code line is 121. This code is used to initialize a memory array using non-sequential memory locations. Click the Source Editor button on the Source window toolbar to open the default code editor and work on optimizing the code. Recap You identified the code section that took the most CPU time to execute. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis, Data of Interest Next Step Tune Algorithms 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 26Tune Algorithms In the Source window, you identified that in the initialize_2D_buffer hotspot function the code line 121 took the most CPU time. Focus on this line and do the following: 1. Open the code editor. 2. Optimize the algorithm used in this code section. Open the Code Editor In the Source window, click the Source Editor button to open the initbuffer.cpp file in the default code editor: Tutorial: Finding Hotspots 1 27Hotspot line is used to initialize a memory array using non-sequential memory locations. For demonstration purposes, the code lines are commented as a slower method of filling the array. Resolve the Problem To resolve this issue, optimize your algorithm as follows: 1. Edit lines 110 and 113 to comment out code lines 111-125 marked as a "First (slower) method". 2. Edit line 144 to uncomment code lines 145-151 marked as a "Faster method". In this step, you interchange the for loops to initialize the code in sequential memory locations. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 283. Save the changes made in the source file. 4. Browse to the directory you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). 5. Rebuild your target in the release mode using the make command as follows: $ make clean $ make release The tachyon_find_hotspots application is rebuilt and stored in the tachyon_vtune_amp_xe directory. 6. Run tachyon_find_hotspots as follows: /home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_find_hotspots dat/balls.dat System runs the tachyon_find_hotspots.exe application. Note that execution time reduced from 83.539 seconds to 43.760 seconds. Recap You interchanged the loops in the hotspot function, rebuilt the application, and got performance gain of 40 seconds. Tutorial: Finding Hotspots 1 29Key Terms and Concepts • Term: hotspot Next Step Compare with Previous Result Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Compare with Previous Result You optimized your code to apply a loop interchange mechanism that gave you 40 seconds of improvement in the application execution time. To understand whether you got rid of the hotspot and what kind of optimization you got per function, re-run the Hotspots analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Hotspots analysis on the modified code. 2. Click the Compare Results button on the Intel(R) VTune(TM) Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Hotspots analysis results you want to compare and click the Compare Results button. The Hotspots Bottom-up window opens, showing the CPU time usage across the two results and the differences side by side. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 30Difference in CPU time between the two results in the following format: = – . CPU time for the initial version of the tachyon_find_hotspots application. CPU time for the optimized version of the tachyon_find_hotspots. Identify the Performance Gain Explore the Bottom-up pane to compare CPU time data for the first hotspot: CPU Time:r001hs - CPU Time:r002hs = CPU Time: Difference. 52.939s - 11.971s = 40.968s, which means that you got the optimization of ~41 seconds for the initialize_2D_buffer function. If you switch to the Summary window, you see that the Elapsed time also shows 3.6 seconds of optimization for the whole application execution: Recap You ran the Hotspots analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis Tutorial: Finding Hotspots 1 31Next Step Read Summary Summary You have completed the Finding Hotspots tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for hotspots: Step 1. Choose and Build Your Target • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the performance per function. Focus on the hotspots - functions that took the most CPU time. By default, they are located at the top of the table. • Double-click the hotspot function in the Bottom-up pane or Call Stack pane to open its source code at the code line that took the most CPU time. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 32Tutorial: Analyzing Locks and Waits 2 Learning Objectives This tutorial shows how to use the Locks and Waits analysis of the Intel(R) VTune(TM) Amplifier XE to identify one of the most common reasons for an inefficient parallel application - threads waiting too long on synchronization objects (locks) while processor cores are underutilized. Focus your tuning efforts on objects with long waits where the system is underutilized. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Locks and Waits analysis type. • Run the Locks and Waits analysis. • Identify the synchronization objects with long waits and poor CPU utilization. • Analyze the source code to locate the most critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify Locks and Waits Workflow Steps to Identify Locks and Waits You can use the Intel(R) VTune(TM) Amplifier XE to understand the cause of the ineffective processor utilization by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 331. Build an application to analyze for locks and waits and create a new VTune Amplifier XE project. 2. Run the Locks and Waits analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to remove the lock. 6. Re-build the target, re-run the Locks and Waits analysis, and compare the result data before and after optimization. Build Application and Create New Project Before you start analyzing your application for locks and waits, do the following: 1. Build application in the release mode with full optimizations. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). Make sure this directory contains Makefile. 2. Clean up all the previous builds using the following command: $ make clean 3. Build your target in the release mode using the following command: $ make release The tachyon_analyze_locks application is built and stored in the tachyon_vtune_amp_xe directory. Create a Performance Baseline 1. Run tachyon_analyze_locks with dat/balls.dat as an input parameter. For example: /home/intel/samplesen/tachyon_vtune_amp_xe/tachyon_analyze_locks dat/balls.dat 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 34The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2. Note the execution time displayed in the window caption and in the shell window. For the tachyon_analyze_locks executable in the figure above, the execution time is 29.647 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit Tutorial: Analyzing Locks and Waits 2 352. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script launching the VTune Amplifier XE. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name tachyon that will be used as the project directory name. VTune Amplifier XE creates a project directory under the root/intel/My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 5. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: /tachyon_analyze_locks (for example, / home/intel/samples/tachyon_vtune_amp_xe/tachyon_analyze_locks). • For the Application parameters field, specify dat/balls.dat. 6. Click OK to apply the settings and exit the Project Properties dialog box. Recap You built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline • Concept: Locks and Waits Analysis Next Step Run Locks and Waits Analysis Run Locks and Waits Analysis Before running an analysis, choose a configuration level to define the Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the Locks and Waits analysis to identify synchronization objects that caused contention and fix the problem in the source. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 36To run an analysis: 1. From the VTune Amplifier XE toolbar, analysis type from the drop-down menuclick the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. From the analysis tree on the left, select Algorithm Analysis > Locks and Waits. The right pane is updated with the default options for the Locks and Waits analysis. 3. Click the Start button on the right command bar. The VTune Amplifier XE launches the tachyon_analyze_locks executable that renders balls.dat as an input file, calculates the execution time, and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Locks and Waits viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the Locks and Waits data collection that analyzes how long the application had to wait on each synchronization object, or on blocking APIs, such as sleep() and blocking I/O, and estimates processor utilization during the wait. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: viewpoint • Concept: Locks and Waits Analysis, Finalization Next Step Interpret Result Data Interpret Result Data Tutorial: Analyzing Locks and Waits 2 37 When the sample application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Locks and Waits viewpoint that consists of the Summary window, Bottom-up pane, Top-down Tree pane, Call Stack pane, and Timeline pane. To interpret the data on the sample code performance, do the following: • Analyze the basic performance metrics provided by the Locks and Waits analysis. • Identify locks. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Analyze the Basic Locks and Waits Metrics Start with exploring the data provided in the Summary window for the whole application performance. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. The Result Summary section provides data on the overall application performance per the following metrics: 1) Elapsed Time is the total time for each core when it was either waiting or not utilized by the application; 2)Total Thread Count is the number of threads in the application; 3)Wait Time is the amount of time the application threads waited for some event to occur, such as synchronization waits and I/O waits; 4) Wait Count is the overall number of times the system wait API was called for the analyzed application; 5) CPU Time is the sum of CPU time for all threads; 6) Spin Time is the time a thread is active in a synchronization construct. For the tachyon_analyze_locks application, the Wait time is high. To identify the cause, you need to understand how this Wait time was distributed per synchronization objects. The Top Waiting Objects section provides the list of five synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 38For the tachyon_analyze_locks application, focus on the first three objects and explore the Bottom-up pane data for more details. The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Note the Target value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal. The Average metric is calculated as CPU time / Elapsed time. Use this number as a baseline for your performance measurements. The closer this number to the number of cores, the better. For the sample code, the chart shows that tachyon_analyze_locks is a multithreaded application running two threads on a machine with four cores. But it is not using available cores effectively. The Average CPU Usage on the chart is about 0.8 while your target should be making it as closer to 4 as possible (for the system with four cores). Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 29 seconds, which is classified as Poor concurrency. The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. The tachyon_analyze_locks application ran mostly on one logical CPU. If you hover over the second bar, you see that it spent 24.897 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. Identify Locks Click the Bottom-up tab to open the Bottom-up pane. Tutorial: Analyzing Locks and Waits 2 39Synchronization objects that control threads in the application. The hash (unique number) appended to some names of the objects identify the stack creating this synchronization object. For Intel(R) Threading Building Blocks (Intel(R) TBB), VTune Amplifier XE is able to recognize all types of Intel TBB objects. To display an overhead introduced by Intel TBB library internals, the VTune Amplifier XE creates a pseudo synchronization object TBB scheduler that includes all waits from the Intel TBB runtime libraries. The utilization of the processor time when a given thread waited for some event to occur. By default, the synchronization objects are sorted by Poor processor utilization type. Bars showing OK or Ideal utilization (orange and green) are utilizing the processors well. You should focus your optimization efforts on functions with the longest poor CPU utilization (red bars if the bar format is selected). Next, search for the longest over-utilized time (blue bars). This is the Data of Interest column for the Locks and Waits analysis results that is used for different types of calculations, for example: call stack contribution, percentage value on the filter toolbar. Number of times the corresponding system wait API was called. For a lock, it is the number of times the lock was contended and caused a wait. Usually you are recommended to focus your tuning efforts on the waits with both high Wait Time and Wait Count values, especially if they have poor utilization. Wait time, during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some Spin time may be preferable to the alternative of the increased thread context switches. However, too much Spin time can reflect lost opportunity for productive work. For the analyzed sample code, you see that the top three synchronization objects caused the longest Wait time. The red bars in the Wait Time column indicate that most of the time for these objects processor cores were underutilized. Consider the first item in the Bottom-up pane that is more interesting. It is a Mutex that shows much serial time and is causing a wait. Click the arrow sign at the object name to expand the node and see the draw_task wait function that contains this mutex and call stack. Double-click the Mutex to see the source code for the wait function. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 40Recap You identified a synchronization object with the high Wait Time and Wait Count values and poor CPU utilization that could be a lock affecting application parallelism. Your next step is to analyze the code of this function. Key Terms and Concepts • Term: Elapsed time, Wait time • Concept: Locks and Waits Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified the mutex that caused significant Wait time and poor processor utilization. Double-click this critical section in the Bottom-up pane to view the source. The Intel(R) VTune(TM) Amplifier XE opens source and disassembly code. Focus on the Source pane and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source View Options The table below explains some of the features available in the Source panefor the Locks and Waits viewpoint. Source code of the application displayed if the function symbol information is available. When you go to the source by double-clicking the synchronization object in the Bottom-up pane, the VTune Amplifier XE opens the wait function containing this object and highlights the code line that took the most Wait time. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected wait function. To view the source code in the Source pane, make sure to build the target properly. Tutorial: Analyzing Locks and Waits 2 41Processor time and utilization bar attributed to a particular code line. The colored bar represents the distribution of the Wait time according to the utilization levels (Idle, Poor, Ok, Ideal, and Over) defined by the VTune Amplifier XE. The longer the bar, the higher the value. Ok utilization level is not available for systems with a small number of cores. This is the Data of Interest column for the Locks and Waits analysis. Number of times the corresponding system wait API was called while this code line was executing. For a lock, it is the number of times the lock was contended and caused a wait. Source window toolbar. Use hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Locks and Waits analysis, this is Wait Time. Use the source file editor button to open and edit your code in your default editor. Identify the Hottest Code Lines The VTune Amplifier XE highlights line 165 entering the rgb_mutex mutex in the draw_task function. The draw_task function was waiting for almost 27 seconds while this code line was executing and most of the time the processor was underutilized. During this time, the critical section was contended 491 times. The rgb_mutex is the place where the application is serializing. Each thread has to wait for the mutex to be available before it can proceed. Only one thread can be in the mutex at a time. You need to optimize the code to make it more concurrent. Click the Source Editor button on the Source window toolbar to open the code editor and optimize the code. Recap You identified the code section that caused a significant wait and during which the processor was poorly utilized. Key Terms and Concepts • Term: Wait time • Concept: CPU Usage, Locks and Waits Analysis, Data of Interest Next Step Remove Lock Remove Lock In the Source window, you located the mutex that caused a significant wait while the processor cores were underutilized and generated multiple wait count. Focus on this line and do the following: 1. Open the code editor. 2. Modify the code to remove the lock. Open the Code Editor Click the Source Editor button to open the analyze_locks.cpp file in your default editor at the hotspot code line: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 42Remove the Lock The rgb_mutex was introduced to protect calculation from multithreaded access. The brief analysis shows that the code is thread safe and the mutex is not really needed. To resolve this issue: 1. Comment out code lines 165 and 172 to disable the mutex. 2. Save the changes made in the source file. 3. Browse to the directory where you extracted the sample code (for example, /home/intel/samples/en/ tachyon_vtune_amp_xe). 4. Rebuild your target in the release mode using the make command as follows: $ make clean $ make release The tachyon_analyze_locks application is rebuilt and stored in the tachyon_vtune_amp_xe directory. 5. Run tachyon_analyze_locks as follows: $ /home/intel/samples/en/tachyon_vtune_amp_xe/tachyon_analyze_locks dat/balls.dat Tutorial: Analyzing Locks and Waits 2 43System runs the tachyon_analyze_locks application. Note that execution time reduced from 29.647 seconds to 14.615 seconds. Recap You optimized the application execution time by removing the unnecessary critical section that caused a lot of Wait time. Key Terms and Concepts • Term: hotspot • Concept: Locks and Waits Analysis Next Step Compare with Previous Result 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 44Compare with Previous Result You made sure that removing the mutex gave you 15 seconds of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Locks and Waits analysis on the modified code. 2. Click the Compare Results button on the Intel(R) VTune(TM) Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Locks and Waits analysis results you want to compare: The Summary window opens providing the statistics for the difference between collected results. Click the Bottom-up tab to see the list of synchronization objects used in the code, Wait time utilization across the two results, and the differences side by side: Difference in Wait time per utilization level between the two results in the following format: = – . By default, the Difference column is expanded to display comparison data per utilization level. You may collapse the column to see the total difference data per Wait time. Wait time and CPU utilization for the initial version of the code. Tutorial: Analyzing Locks and Waits 2 45Wait time and CPU utilization for the optimized version of the code. Difference in Wait count between the two results in the following format: = - . Wait count for the initial version of the code. Wait count for the optimized version of the code. Identify the Performance Gain The Elapsed time data in the Summary window shows the optimization of 4 seconds for the whole application execution and Wait time decreased by 37.5 seconds. According to the Thread Concurrency histogram, before optimization (blue bar) the application ran serially for 9 seconds poorly utilizing available processor cores but after optimization (orange bar) it ran serially only for 2 seconds. After optimization the application ran 5 threads simultaneously overutilizing the cores for almost 5 seconds. Further, you may consider this direction as an additional area for improvement. In the Bottom-up pane, locate the Mutex you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r004lw does not show any performance data for this synchronization object. If you collapse the Wait Time:Difference column by clicking the button, you see that with the optimized result you got almost 27 seconds of optimization in Wait time. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 46Recap You ran the Locks and Waits analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. The comparison shows that, with the optimized version of the tachyon_analyze_locks application (r004lw result), you managed to remove the lock preventing application parallelism and significantly reduce the application execution time. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxecl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, Wait time • Concept: Locks and Waits Analysis, CPU Usage Next Step Read Summary Summary You have completed the Analyzing Locks and Waits tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for locks and waits: Step 1. Choose and Build Your Target • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Tutorial: Analyzing Locks and Waits 2 47Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application with the Summary pane to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the synchronization objects. Focus on the synchronization objects that under- or over-utilized the available logical CPUs and have the highest Wait time and Wait Count values. By default, the objects with the highest Wait time values show up at the top of the window. • Expand the most time-critical synchronization object in the Bottom-up pane and double-click the wait function it belongs to. This opens the source code for this wait function at the code line with the highest Wait time value. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. • Expand each data column by clicking the button to identify the performance gain per CPU utilization level. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 48Tutorial: Identifying Hardware Issues 3 Learning Objectives This tutorial shows how to use the General Exploration analysis of the Intel(R) VTune(TM) Amplifier XE to identify the hardware-related issues in the sample application. Estimated completion time: 15 minutes. Sample application: matrix. After you complete this tutorial, you should be able to: • Choose an analysis target. • Run the General Exploration analysis for Intel(R) microarchitecture code name Nehalem. • Understand the event-based performance metrics. • Identify the types of the most critical hardware issues for the application as a whole. • Identify the modules/functions that caused the most critical hardware issues. • Analyze the source code to locate the most critical code lines. • Identify the next steps of the performance analysis to get more detailed results. Start Here Workflow Steps to Identify Hardware Issues Workflow Steps to Identify Hardware Issues You can use an advanced event-based sampling analysis of the Intel® VTune™ Amplifier XE to identify the most significant hardware issues that affect the performance of your application. This tutorial guides you through these workflow steps running the General Exploration analysis type on a sample matrix application. 491. Build an application to analyze for hardware issues and create a new VTune Amplifier XE project. 2. Choose and run the General Exploration analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical functions. 5. Modify the code to resolve the detected performance issues and rebuild the code. Build Application and Create New Project Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Build application in the release mode with full optimizations. 2. Create a VTune Amplifier XE project. Choose a Build Mode and Build a Target 1. Browse to the directory where you extracted the sample code (for example, /home/sample/matrix/ linux). Make sure this directory contains Makefile. 2. Build your target in the release mode using the make command. The matrix application is automatically built with the GNU* compiler (as matrix.gcc) and stored in the matrix/linux directory. Create a Project 1. Set the EDITOR or VISUAL environment variable to associate your source files with the code editor (like emacs, vi, vim, gedit, and so on). For example: $ export EDITOR=gedit 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 502. From the /bin32 directory (for IA-32 architecture) or from the /bin64 directory (for Intel(R) 64 architecture), run the amplxe-gui script lauching VTune Amplifier XE GUI client. By default, the is /opt/intel/vtune_amplifier_xe_2011. 3. Create a new project via File > New > Project.... The Create a Project dialog box opens. 4. Specify the project name matrix that will be used as the project directory name and click the Create Project button. By default, the VTune Amplifier XE creates a project directory under the root/intel/amplxe/Projects directory and opens the Project Properties: Target dialog box. 5. In the Target: Application to Launch pane, browse to the matrix.gcc application and click OK. Recap You built the target in the Release mode and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target • Concept: Event-based Sampling Analysis Next Step Run General Exploration Analysis Run General Exploration Analysis Before running an analysis, choose a configuration level to influence Intel(R) VTune(TM) Amplifier XE analysis scope and running time. In this tutorial, you run the General Exploration analysis on the Intel(R) Core(TM) i7 processor based on the Intel(R) microarchitecture code name Nehalem. The General Exploration analysis type helps identify the widest scope of hardware issues that affect the application performance. This analysis type is based on the hardware event-based sampling collection. To run the analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. Tutorial: Identifying Hardware Issues 3 51The New Amplifier XE Result tab opens with the Analysis Type configuration window active. 2. From the analysis tree on the left, select the Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration analysis type. 3. Click the Start button on the right to run the analysis. The VTune Amplifier XE launches the matrix application that calculates matrix transformations and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Hardware Issues viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the General Exploration analysis that monitors how your application performs against a set of eventbased hardware metrics. To see the list of processor events used for this analysis type, see the Details section of the General Exploration configuration pane. Key Terms and Concepts • Term: viewpoint • Concept: Event-based Sampling Analysis, Finalization Next Step Interpret Results Interpret Results When the application exits, the Intel(R) VTune(TM) Amplifier XE finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following: • Understand the event-based metrics • Identify the hardware issues that affect the performance of your application 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 52NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Event-based Metrics Click the Summary tab to explore the data provided in the Summary window for the whole application performance. Elapsed time is the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric. Event-based performance metrics. Each metric is an event ratio provided by Intel architects. Mouse over the yellow icon to see the metric description and formula used for the metric calculation. Values calculated for each metric based on the event count. VTune Amplifier XE highlights those values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue. Tutorial: Identifying Hardware Issues 3 53The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description. Quick look at the summary results discovers that the matrix application has the following issues: • CPI (Clockticks per Instructions Retired) Rate • Retire Stalls • LLC Miss • LLC Load Misses Serviced by Remote DRAM • Execution Stalls • Data Sharing Identify the Hardware Issues Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot. By default, the VTune Amplifier XE sorts data in the descending order by Clockticks and provides the hotspots at the top of the list. You see that the multiply1 function is the most obvious hotspot in the matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function. NOTE Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit. For the multiply1 function, the VTune Amplifier XE highlights the same issues that were detected as the issues affecting the performance of the whole application: • CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 54• The Retire Stalls metric shows that during the execution of the multiply1 function, about 90% (0.945) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause. • LLC misses metric shows that about 120% (1.220) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system. • LLC Load Misses Serviced by Remote DRAM metric shows that 55% (0.554) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on. • Execution Stalls metric shows that 36% (0.364) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing div operations with right-shifts or try to reduce the latency of memory accesses. • Data Sharing metric took about 7% (0.066) of cycles. To understand the cause, examine the Contested Accesses metric to determine whether the major component of data sharing is due to contested accesses or simple read sharing. Read sharing is a lower priority than Contested Accesses or issues such as LLC Misses and Remote Accesses. If simple read sharing is a performance bottleneck, consider changing data layout across threads or rearranging computation. However, this type of tuning may not be straightforward and could bring more serious performance issues back. Recap You analyzed the data provided in the Hardware Issues viewpoint, explored the event-based metrics, and identified the areas where your sample application had hardware issues. Also, you were able to identify the exact function with poor performance per metrics and that could be a good candidate for further analysis. Key Terms and Concepts • Term: viewpoint, baseline, Elapsed time • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Analyze Code Analyze Code You identified a hotspot function with a number of hardware issues. Double-click the multiply1 function in the Bottom-up window to open the source code: Tutorial: Identifying Hardware Issues 3 55The table below explains some of the features available in the Source pane when viewing the event-based sampling analysis data. Source pane displaying the source code of the application, which is available if the function symbol information is available. The code line that took the highest number of Clockticks samples is highlighted. The source code in the Source pane is not editable. Values per hardware event attributed to a particular code line. By default, the data is sorted by the Clockticks event count. Focus on the events that constitute the metrics identified as performancecritical in the Bottom-up window. To identify these events, mouse over the metric column header in the Bottom-up window. Drag-and-drop the columns to organize the view for your convinience. VTune Amplifier XE remembers yours settings and restores them each time you open the viewpoint. Hotspot navigation buttons to switch between code lines that took a long time to execute. Source file editor button to open and edit your code in the default editor. Assembly button to toggle in the Assembly pane that displays assembly instructions for the selected function. In the Source pane for the multiply1 function, you see that line 38 took the most of the Clockticks event samples during execution. But from your code knowledge, you understand that the culprit should be line 39. Due to event skid (that may happen at the low granularity level like source line, instruction, or basic block), the VTune Amplifier XE mistakenly attributed the samples collected for line 39 to line 38. This code section multiplies matrices in the loop but ineffectively accesses the memory. Focus on this section and try to reduce the memory issues. Recap You analyzed the code for the hotspot function identified in the Bottom-up window and located the hotspot line that generated a high number of CPU Clockticks. Key Terms and Concepts • Concept: Event Skid 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 56Next Step Resolve Issue Resolve Issue In the Source pane, you identified that in the multiply1 function the code line 39 resulted in the highest values for the Clockticks event. To solve this issue, do the following: • Change the multiplication algorithm and, if using the Intel(R) compiler, enable vectorization. • Re-run the analysis to verify optimization. Change Algorithm NOTE The proposed solution is one of the multiple ways to optimize the memory access and is used for demonstration purposes only. 1. Open the matrix.c file from the sample code directory (for example, /home/sample/matrix/src). For this sample, the matrix.c file is used to initialize the functions used in the multiply.c file. 2. In line 90, replace the multiply1 function name with the multiply2 function. This new function uses the loop interchange mechanism that optimizes the memory access in the code. Tutorial: Identifying Hardware Issues 3 57The proposed optimization assumes you may use the Intel(R) C++ Compiler to build the code. Intel compiler helps vectorize the data, which means that it uses SIMD instructions that can work with several data elements simultaneously. If only one source file is used, the Intel compiler enables vectorization automatically. The current sample uses several source files, that is why the multiply2 function uses #pragma ivdep to instruct the compiler to ignore assumed vector dependencies. This information lets the compiler enable the Supplemental Streaming SIMD Extensions (SSSE). 3. Save files and rebuild the project using the compiler of your choice. If you have the Intel(R) compiler installed, you may run it from the code sample directory (for example: / home/sample/matrix/linux) as follows: make icc The matrix application is automatically built with the Intel compiler (as matrix.icc) and stored in the matrix/linux directory. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the Project Properties button. The Project Properties dialog box opens with the Target tab active. The Launch Application pane is open by default. 2. In the Application field, click the Browse... button and navigate to the updated matrix application. This tutorial uses the application compiled with the Intel compiler, matrix.icc. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 583. Click OK to close the dialog box. 4. From the VTune Amplifier XE toolbar, click the New Analysis button. The Analysis Type configuration window opens . 5. From the left pane, select Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration and click the Start button on the right. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r001ge, that opens automatically. 6. In the r001ge result, click the Summary tab to see the Elapsed time value for the optimized code: You see that the Elapsed time has reduced from 15.730 seconds to 1.678 seconds and the VTune Amplifier XE now identifies only three types of issues for the application performance: high CPI Rate,Retire Stalls, and LLC Miss. Recap You solved the memory access issue for the sample application by interchanging the loops and sped up the execution time. You also considered using the Intel compiler to enable instruction vectorization. Key Terms and Concepts • Concept: Event-based Sampling Analysis Tutorial: Identifying Hardware Issues 3 59Next Step Resolve Next Issue Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Resolve Next Issue You got a significant performance boost by optimizing the memory access for the multiply1 function. According to the data provided in the Summary window for your updated result, r001ge, you still have high CPI rate, LLC Miss, and Retire Stalls issues. You can try to optimize your code further following the steps below: • Analyze results after optimization • Use more advanced algorithms • Verify optimization Analyze Results after Optimization To get more details on the issues that still affect the performance of the matrix application, switch to the Bottom-up window: You see that the multiply2 function (in fact, updated multiply1 function) is still a hotspot. Double-click this function to view the source code and click both the Source and Assembly buttons on the toolbar to enable the Source and Assembly panes. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 60In the Source pane, the VTune Amplifier XE highlights line 53 that took the highest number of Clockticks samples. This is again the section where matrices are multiplied. The Assembly pane is automatically synchronized with the Source pane. It highlights the basic blocks corresponding to the code line highlighted in the Source pane. If you compiled the application with the Intel(R) Compiler, you can see that highlighted block 1 includes vectorization instructions added after your previous optimization. All vectorization instructions have the p (packed) postfix (for example, mulpd). You may use the /Qvec-report3 option of the Intel compiler to generate the compiler optimization report and see which cycles were not vectorized and why. For more details, see the Intel compiler documentation. Use More Advanced Algorithms 1. Open the matrix.c file from the Source Files of the matrix project. 2. In line 90, replace the multiply2 function name with the multiply3 function. This function enables uploading the matrix data by blocks. Tutorial: Identifying Hardware Issues 3 613. Save the files and rebuild the project. Verify Optimization 1. From the VTune Amplifier XE File menu, select New > Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r002ge, that opens automatically. 2. In the r002ge result, click the Summary tab to see the Elapsed time value for the optimized code: You see that the Elapsed time has reduced a little: from 1.678 seconds to 1.244 seconds but the hardware issues identified in the previous run, CPI Rate, Retire Stalls, and LLC Miss, stayed practically the same. This means that there is more room for improvement and you can try other, more effective, mechanisms of matrix multiplication. Recap You tried optimizing the mechanism of matrix multiplication and obtained 0.4 seconds of optimization in the application execution time. Key Terms and Concepts • Concept: Event-based Sampling Analysis, Event-based Metrics 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 62Next Step Read Summary Summary You have completed the Identifying Hotspot Issues tutorial. Here are some important things to remember when using the Intel(R) VTune(TM) Amplifier XE to analyze your code for hardware issues: Step 1. Choose and Build Your Target • Create a VTune Amplifier XE project and u se the Project Properties: Target tab to choose and configure your analysis target. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. You may choose between a predefined analysis type like the General Exploration type used in this tutorial, or create a new custom analysis type and add events of your choice. For more details on the custom collection, see the Creating a New Analysis Type topic in the product online help. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the eventbased performance metrics for the whole application. Mouse over the yellow help icons to read the metric descriptions. Use the Elapsed time value as your performance baseline. • Move to the Bottom-up window and analyze the performance per function. Focus on the hotspots - functions that took the highest Clockticks event count. By default, they are located at the top of the table. Analyze the hardware issues detected for the hotspot functions. Hardware issues are highlighted in pink. Mouse over a highlighted value to read the issues description and see the threshold formula. • Double-click the hotspot function in the Bottom-up pane to open its source code at the code line that took the highest Clockticks event count. • Consider using Intel(R) Compiler to vectorize instructions. Explore the compiler documentation for more details. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Tutorial: Identifying Hardware Issues 3 633 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 64More Resources 4 Getting Help Intel(R) VTune(TM) Amplifier XE provides a number of Getting Started tutorials. These tutorials use a sample application to demo you the basic product features and workflows. You can access these documents through the Help menu or by clicking the VTune Amplifier XE icon . : For the standalone user interface, the tutorials are available via Help > Getting Started Tutorials menu. To view help in the standalone user interface, select Intel VTune Amplifier XE 2011 Help from the Help menu. Navigating in the Product Usage Workflow Where applicable, the VTune Amplifier XE help topics provide a Where am I in the workflow? button. Click the button to view the workflow with a highlight on the stage that this topic discusses. Using Context-Sensitive Help Context-sensitive help enables easy access to help topics on active GUI elements. The following contextsensitive help features are available on a product-specific basis: • F1 Help: Press F1 to get help for an active dialog box, property page, pane, or window. Product Website and Support Product Website and Support The following links provide information and support on Intel software products, including Intel(R) Parallel Studio XE: • http://software.intel.com/en-us/articles/tools/ Intel(R) Software Development Products Knowledge Base. • http://www.intel.com/software/products/support/ Technical support information, to register your product, or to contact Intel. For additional support information, see the Technical Support section of your Release Notes. System Requirements For detailed information on system requirements, see the Release Notes. 654 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS 66Intel(R) VTune(TM) Amplifier XE Tutorials Troubleshooting 5 Troubleshooting Problem: The Start button is disabled The Start button on the command toolbar is disabled. Solution: Make sure you specified an analysis target. If the target is not specified, click the Project Properties button on the command toolbar and enter the target name in the Application to Launch pane. For the General Exploration analysis, the Start button may be disabled if you mistakenly chose the incorrect processor type. The selected analysis type should match your processor type. 67 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS Document Number: 323906-005US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information................................................................................5 Introducing the Intel® VTune™ Amplifier XE.........................................7 Prerequisites........................................................................................9 Navigation Quick Start.......................................................................11 Key Terms and Concepts....................................................................13 Chapter 1: Tutorial: Finding Hotspots Learning Objectives..................................................................................17 Workflow Steps to Identify and Analyze Hotspots.........................................17 Visual Studio* IDE: Choose Project and Build Application..............................18 Standalone GUI: Build Application and Create New Project............................24 Run Hotspots Analysis..............................................................................29 Interpret Result Data................................................................................30 Analyze Code..........................................................................................33 Tune Algorithms......................................................................................34 Compare with Previous Result....................................................................37 Summary................................................................................................39 Chapter 2: Tutorial: Analyzing Locks and Waits Learning Objectives..................................................................................41 Workflow Steps to Identify Locks and Waits.................................................41 Visual Studio* IDE: Choose Project and Build Application..............................42 Standalone GUI: Build Application and Create New Project............................48 Run Locks and Waits Analysis....................................................................53 Interpret Result Data................................................................................54 Analyze Code..........................................................................................57 Remove Lock...........................................................................................58 Compare with Previous Result....................................................................60 Summary................................................................................................63 Chapter 3: Tutorial: Identifying Hardware Issues Learning Objectives..................................................................................65 Workflow Steps to Identify Hardware Issues................................................65 Visual Studio* IDE: Choose Project and Build Application..............................66 Standalone GUI: Build Application and Create New Project............................70 Run General Exploration Analysis...............................................................74 Interpret Results......................................................................................75 Analyze Code..........................................................................................78 Resolve Issue..........................................................................................79 Resolve Next Issue...................................................................................82 Summary................................................................................................85 Chapter 4: More Resources Getting Help............................................................................................87 Product Website and Support.....................................................................88 Contents 3Chapter 5: Intel® VTune™ Amplifier XE Tutorials Troubleshooting Troubleshooting.......................................................................................89 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 4Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Java is a registered trademark of Oracle and/or its affiliates. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. 5 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 6Introducing the Intel® VTune™ Amplifier XE The Intel ® VTune™ Amplifier XE, an Intel ® Parallel Studio XE tool, provides information on code performance for users developing serial and multithreaded applications on Windows* and Linux* operating systems. On Windows systems, the VTune Amplifier XE integrates into Microsoft Visual Studio* software and is also available as a standalone GUI client. On Linux systems, VTune Amplifier XE works only as a standalone GUI client. On both Windows and Linux systems, you can benefit from using the command-line interface for collecting data remotely or for performing regression testing. VTune Amplifier XE helps you analyze the algorithm choices and identify where and how your application can benefit from available hardware resources. Use the VTune Amplifier XE to locate or determine the following: • The most time-consuming (hot) functions in your application and/or on the whole system • Sections of code that do not effectively utilize available processor time • The best sections of code to optimize for sequential performance and for threaded performance • Synchronization objects that affect the application performance • Whether, where, and why your application spends time on input/output operations • The performance impact of different synchronization methods, different numbers of threads, or different algorithms • Thread activity and transitions • Hardware-related bottlenecks in your code Intel VTune Amplifier XE Tutorials These tutorials tell you how to use the VTune Amplifier XE to analyze the performance of a sample application by identifying software- and hardware-related issues in the code. • Finding Hotspots • Analyzing Locks and Waits • Identifying Hardware Issues Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the printable version (PDF) of product tutorials. See Also Getting Help 7 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 8Prerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE The instructions and screen shots in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE). They may slightly differ for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE. See online help for details. Required Tools You need the following tools to use these tutorials: • Intel ® VTune™ Amplifier XE • Sample code included with the VTune Amplifier XE. VTune Amplifier XE provides the following sample applications: • tachyon application used for the Finding Hotspots and Analyzing Locks and Waits tutorials • matrix application used for the Identifying Hardware Issues tutorial • VTune Amplifier XE Help • Microsoft Visual Studio* 2005 or later To acquire the VTune Amplifier XE: If you do not already have access to the VTune Amplifier XE, you can download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/. To install the VTune Amplifier XE, follow the instructions in the Release Notes. To install and set up VTune Amplifier XE sample code: 1. Copy the tachyon_vtune_amp_xe.zip and matrix_vtune_amp_xe.zip files from the samples \\C++ folder in the IntelVTune Amplifier XE installation directory to a writable directory or share on your system. The default installation directory is C:\Program Files\Intel\VTune Amplifier XE 2011 (on certain systems, instead of Program Files, the folder name is Program Files (x86)). 2. Extract the sample(s) from the .zip file. NOTE • Samples are non-deterministic. Your screens may vary from the screen shots shown throughout these tutorials. • Samples are designed only to illustrate VTune Amplifier XE features and do not represent best practices for tuning the code. Results may vary depending on the nature of the analysis. To run the VTune Amplifier XE: • For Microsoft Visual Studio*: VTune Amplifier XE integrates into Visual Studio when installation completes. To configure and run an analysis, open your solution and go to Tools > Intel VTune Amplifier XE 2011 > New Analysis... from the Visual Studio menu or click the New Analysis button from the VTune Amplifier XE toolbar. See the Navigation Quick Start for more details. • For the standalone interface: From the Start menu, select All Programs > Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011. 9To access VTune Amplifier XE Help: See the Getting Help topic. Required Skills and Knowledge These tutorials are designed for developers with the following skills and knowledge: • Basic understanding of the Microsoft Visual Studio* 2005 development environment (IDE), including how to: • Open a project/solution. • Display the Solution Explorer and Output windows. • Compile and link a project. • Ensure a project compiled successfully. • Access the Document Explorer window. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 10Navigation Quick Start Intel® VTune™ Amplifier XE /Microsoft Visual Studio* 2005 Integration NOTE This topic describes integration into Microsoft Visual Studio* 2005. Integration to other version of Visual Studio IDE or the standalone VTune Amplifier XE interface may slightly differ. The VTune Amplifier XE integrates into the Visual Studio* development environment (IDE) and can be accessed from the menus, toolbar, and Solution Explorer in the following manner: Use the VTune Amplifier XE toolbar to configure and control result collection. VTune Amplifier XE results *.amplxe show up in the Solution Explorer under the My Amplifier XE Results folder. To configure and control result collection, right-click the project in the Solution Explorer and select the Intel VTune Amplifier XE 2011 menu from the popup menu. To manage previously collected results, right-click the result (for example, r002hs.amplxe) and select the required command from the pop-up menu. 11Use the drop-down menu to select a viewpoint, a preset configuration of windows/panes for an analysis result. For each analysis type, you can switch among several preset configurations to focus on particular performance metrics. Click the buttons on navigation toolbars to change window views and toggle window panes on and off. In the Timeline pane, analyze the thread activity and transitions presented for the user-mode sampling and tracing analysis results (for example, Hotspots, Concurrency, Locks and Waits) or analyze the distribution of the application performance per metric over time for the eventbased sampling analysis results (for example, Memory Access, Bandwidth Breakdown). Use the Call Stack pane to view call paths for a function selected in the grid. Use the filter toolbar to filter out the result data according to the selected categories. In Microsoft Visual Studio* 2005/2008, use the Dynamic Help window to access help topics related to the current VTune Amplifier XE window/pane. Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 12Key Terms and Concepts Key Terms baseline: A performance metric used as a basis for comparison of the application versions before and after optimization. Baseline should be measurable and reproducible. CPU time: The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed. The application CPU time is the sum of the CPU time of all the threads that run the application. Elapsed time:The total time your target ran, calculated as follows: Wall clock time at end of application – Wall clock time at start of application. hotspot: A section of code that took a long time to execute. Some hotspots may indicate bottlenecks and can be removed, while other hotspots inevitably take a long time to execute due to their nature. target: A target is an executable file you analyze using the Intel ® VTune™ Amplifier XE. viewpoint: A preset result tab configuration that filters out the data collected during a performance analysis and enables you to focus on specific performance problems. When you select a viewpoint, you select a set of performance metrics the VTune Amplifier XE shows in the windows/panes of the result tab. To select the required viewpoint, click the button and use the drop-down menu at the top of the result tab. Wait time: The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits. Key Concept: CPU Usage For the user-mode sampling and tracing analysis types, the Intel ® VTune™ Amplifier XE identifies a processor utilization scale, calculates the target CPU usage, and defines default utilization ranges depending on the number of processor cores. You can change the utilization ranges by dragging the slider in the CPU Usage histogram in the Summary window. Utilizatio n Type Default color Description Idle All CPUs are waiting - no threads are running. Poor Poor usage. By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage. OK Acceptable (OK) usage. By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage. Ideal Ideal usage. By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage. Key Concept: Data of Interest The VTune Amplifier XE maintains a special column called Data of Interest. This column is highlighted with yellow background and a yellow star in the column header . The data in the Data of Interest column is used by various windows as follows: 13• The Call Stack pane calculates the contribution, shown in the contribution bar, using the Data of Interest column values. • The Filter bar uses the data of interest values to calculate the percentage indicated in the filtered option. • The Source/Assembly window uses this column for hotspot navigation. If a viewpoint has more than one column with numeric data or bars, you can change the default Data of Interest column by right-clicking the required column and selecting the Set Column as Data of Interest command from the pop-up menu. Key Concept: Event-based Metrics When analyzing data collected during a hardware event-based sampling analysis, the VTune Amplifier XE uses the performance metrics. Each metric is an event ratio with its own threshold values. As soon as the performance of a program unit per metric exceeds the threshold, the VTune Amplifier XE marks this value as a performance issue (in pink) and provides recommendations how to fix it. Each column in the Bottom-up pane provides data per metric. To read the metric description and see the formula used for the metric calculation, mouse over the metric column header. To read the description of the hardware issue and see the threshold formula used for this issue, mouse over the link cell in the grid. For the full list of metrics used by the VTune Amplifier XE, see the Hardware Event-based Metrics topic in the online help. Key Concept: Event-based Sampling Analysis VTune Amplifier XE introduces a set of advanced hardware analysis types based on the event-based sampling data collection and targeted for the Intel ® Core™ 2 processor family, processors based on the Intel ® microarchitecture code name Nehalem and Intel ® microarchitecture code name Sandy Bridge. Depending on the analysis type, the VTune Amplifier XE monitors a set of hardware events and, as a result, provides collected data per, so-called, hardware event-based metrics defined by Intel architects (for example, Clockticks per Instructions Retired, Contested Accesses, and so on). Typically, you are recommended to start with the General Exploration analysis type that collects the maximum number of events and provides the widest picture of the hardware issues that affected the performance of your application. For more information on the event-based sampling analysis, see the Hardware Event-based Sampling Collection topic in the online help. Key Concept: Event Skid Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function. Event skid is caused by a number of factors: • The delay in propagating the event out of the processor's microcode through the interrupt controller (APIC) and back into the processor. • The current instruction retirement cycle must be completed. • When the interrupt is received, the processor must serialize its instruction stream which causes a flushing of the execution pipeline. The Intel(R) processors support accurate event location for some events. These events are called precise events.See the online help for more details. Key Concept: Finalization Finalization is the process of the Intel ® VTune™ Amplifier XE converting the collected data to a database, resolving symbol information, and pre-computing data to make further analysis more efficient and responsive. The VTune Amplifier XE finalizes data automatically when data collection completes. You may want to re-finalize a result to: Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 14• update symbol information after changes in the search directories settings • resolve the number of [Unknown]-s in the results Key Concept: Hotspots Analysis The Hotspots analysis helps understand the application flow and identify sections of code that took a long time to execute (hotspots). A large number of samples collected at a specific process, thread, or module can imply high processor utilization and potential performance bottlenecks. Some hotspots can be removed, while other hotspots are fundamental to the application functionality and cannot be removed. The Intel ®VTune™ Amplifier XE creates a list of functions in your application ordered by the amount of time spent in a function. It also detects the call stacks for each of these functions so you can see how the hot functions are called. The VTune Amplifier XE uses a low overhead (about 5%) user-mode sampling and tracing collection that gets you the information you need without slowing down the application execution significantly. Key Concept: Locks and Waits Analysis While the Concurrency analysis helps identify where your application is not parallel, the Locks and Waits analysis helps identify the cause of the ineffective processor utilization. One of the most common problems is threads waiting too long on synchronization objects (locks). Performance suffers when waits occur while cores are under-utilized. During the Locks and Waits analysis you can estimate the impact each synchronization object introduces to the application and understand how long the application was required to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. Key Concept: Thread Concurrency The number of active threads corresponds to the concurrency level of an application. By comparing the concurrency level with the number of processors, Intel ® VTune™ Amplifier XE classifies how an application utilizes the processors in the system. It defines default utilization ranges depending on the number of processor cores and displays the thread concurrency in the Summary and Bottom-up window. You can change the utilization ranges by dragging the slider in the Summary window. Thread concurrency may be higher than CPU Usage if threads are in the runnable state and not consuming CPU time. VTune Amplifier XE defines the Target Concurrency level for your application that is, by default, equal to the number of physical cores. Utilizatio n Type Default color Description Idle All threads in the application are waiting - no threads are running. There can be only one bar in the Thread Concurrency histogram indicating Idle utilization. Poor Poor utilization. By default, poor utilization is when the number of threads is up to 50% of the target concurrency. OK Acceptable (OK) utilization. By default, OK utilization is when the number of threads is between 51-85% of the target concurrency. Ideal Ideal utilization. By default, ideal utilization is when the number of threads is between 86-115% of the target concurrency. Over Over-utilization. By default, over-utilization is when the number of threads is more than 115% of the target concurrency. Key Terms and Concepts 15 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 16Tutorial: Finding Hotspots 1 Learning Objectives This tutorial shows how to use the Hotspots analysis of the Intel® VTune™ Amplifier XE to understand where the sample application is spending time, identify hotspots - the most time-consuming program units, and detect how they were called. Some hotspots may indicate bottlenecks that can be removed, while other hotspots are inevitable and take a long time to execute due to their nature. Typically, the hotspot functions identified during the Hotspots analysis use the most time-consuming algorithms and are good candidates for parallelization. The Hotspots analysis is useful to analyze the performance of both serial and parallel applications. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Hotspots analysis type. • Run the Hotspots analysis to locate most time-consuming functions in an application. • Analyze the function call flow and threads. • Analyze the source code to locate the most time-critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify and Analyze Hotspots Workflow Steps to Identify and Analyze Hotspots You can use the Intel® VTune™ Amplifier XE to identify and analyze hotspot functions in your serial or parallel application by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 171. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application • Standalone GUI: Build an application to analyze for hotspots and create a new VTune Amplifier XE project 2. Choose and run the Hotspots analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to tune the algorithms or rebuild the code with Intel® Compiler. 6. Re-build the target, re-run the Hotspots analysis, and compare the result data before and after optimization. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Visual Studio* IDE: Choose Project and Build Application Before you start analyzing your application target for hotspots, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. 5. Run the application without debugging to create a performance baseline. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 18For this tutorial, your target is a ray-tracer application, tachyon. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio 2005. They may slightly differ for other versions of Visual Studio. • Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location you used to extract the tachyon_vtune_amp_xe.zip file and select the tachyon_vtune_amp_xe.sln file. The solution is added to Visual Studio IDE and shows up in the Solution Explorer. 3. In the Solution Explorer, right-click the find_hotspots project and select Project > Set as StartUp Project. find_hotspots appears in bold in the Solution Explorer. When you choose a project in Visual Studio IDE, the VTune Amplifier XE automatically creates the config.amplxeproj project file and sets the find_hotspots application as an analysis target in the project properties. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Finding Hotspots 1 197. Click OK. Enable Generating Debug Information for Your Binary Files 1. Select the find_hotspots project and go to Project > Properties. 2. From the find_hotspots Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the find_hotspots Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 204. From the find_hotspots Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Finding Hotspots 1 21Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build find_hotspots. The tachyon_find_hotspots.exe application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 222. Note the execution time displayed in the window caption. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 63.609 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Recap You chose the target for the Hotspots analysis, set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, and created the performance baseline. Your application is ready for analysis. Key Terms and Concepts • Term: target • Concept: Hotspots Analysis Next Step Run Hotspots Analysis Tutorial: Finding Hotspots 1 23Standalone GUI: Build Application and Create New Project Before you start analyzing your application target for hotspots, do the following: 1. Build application. If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio 2005. They may differ slightly for other versions of Visual Studio. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 247. Click OK. Enable Generating Debug Information for Your Binary Files 1. Select the find_hotspots project and go to Project > Properties. 2. From the find_hotspots Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the find_hotspots Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). Tutorial: Finding Hotspots 1 254. From the find_hotspots Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 26Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build find_hotspots. The tachyon_find_hotspots.exe application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_find_hotspots.exe application starts running. NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. Tutorial: Finding Hotspots 1 272. Note the execution time displayed in the window caption. For the tachyon_find_hotspots.exe executable in the figure above, the execution time is 63.609 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE standalone GUI. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name tachyon that will be used as the project directory name. The VTune Amplifier XE creates the tachyon project directory under the %USERPROFILE%\My Documents \My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: \find_hotspots.exe, for example: C: \samples\tachyon_vtune_amp_xe\vc8\find_hotspots_Win32_Release\find_hotspots.exe. 5. Click OK to apply the settings and exit the Project Properties dialog box. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 28Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline Next Step Run Hotspots Analysis Run Hotspots Analysis In this tutorial, you run the Hotspots analysis to identify the hotspots that took much time to execute. To run an analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. On the left pane of the Analysis Type window, locate the analysis tree and select Algorithm Analysis > Hotspots. The right pane is updated with the default options for the Hotspots analysis. 3. Click the Start button on the right command bar. VTune Amplifier XE launches the tachyon_find_hotspots application that renders balls.dat as an input file, calculates the execution time, and exits. VTune Amplifier XE finalizes the collected results and opens the Hotspots viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You launched the Hotspots data collection that analyzes function calls and CPU time spent in each program unit of your application. Tutorial: Finding Hotspots 1 29NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: hotspot, Elapsed time, viewpoint • Concept: Hotspot Analysis, Finalization Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Hotspots viewpoint that consists of the Summary, Bottom-up, and Top-down Tree windows. To interpret the data on the sample code performance, do the following: • Understand the basic performance metrics provided by the Hotspots analysis. • Analyze the most time-consuming functions. • Analyze CPU usage per function. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Basic Hotspots Metrics Start analysis with the Summary window. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. Note that CPU Time for the sample application is equal to 64.907 seconds. It is the sum of CPU time for all application threads. Total Thread Count is 3, so the sample application is multi-threaded. The Top Hotspots section provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 30For the sample application, the initialize_2D_buffer function, which took 27.671 seconds to execute, shows up at the top of the list as the hottest function. The [Others] entry at the bottom shows the sum of CPU time for all functions not listed in the table. Analyze the Most Time-consuming Functions Click the Bottom-up tab to explore the Bottom-up pane. By default, the data in the grid is sorted by Function. You may change the grouping level using the Grouping drop-down menu at the top of the grid. Analyze the CPU Time column values. This column is marked with a yellow star as the Data of Interest column. It means that the VTune Amplifier XE uses this type of data for some calculations (for example, filtering, stack contribution, and others). Functions that took most CPU time to execute are listed on top. The initialize_2D_buffer function took 27.671 seconds to execute. Click the plus sign at the initialize_2D_buffer function to expand the stacks calling this function. You see that it was called only by the setup_2D_buffer function. Select the initialize_2D_buffer function in the grid and explore the data provided in the Call Stack pane on the right. The Call Stack pane displays full stack data for each hotspot function, enables you to navigate between function call stacks and understand the impact of each stack to the function CPU time. The stack functions in the Call Stack pane are represented in the following format: ! - :, where the line number corresponds to the line calling the next function in the stack. For the sample application, the hottest function initialize_2D_buffer is called at line 86 of the setup_2D_buffer function in the global.cpp file. Analyze CPU Usage per Function VTune Amplifier XE enables you to analyze the collected data from different perspectives by using multiple viewpoints. For the Hotspots analysis result, you may switch to the Hotspots by CPU Usage viewpoint to understand how your hotspot function performs in Tutorial: Finding Hotspots 1 31terms of the CPU usage. Explore this viewpoint to determine how your application utilized available cores and identify the most serial code. If you go back to the Summary window, you can see the CPU Usage Histogram that represents the Elapsed time and usage level for the available logical processors. Ideally, the highest bar of your chart should match the Target level. The tachyon_find_hotspots application ran mostly on one logical CPU. If you hover over the highest bar, you see that it spent 62.491 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization for a dual-core system. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. To get the detailed CPU usage information per function, use the button in the Bottom-up window to expand the CPU Time column. Note that initialize_2D_buffer is the function with the longest poor CPU utilization (red bars). This means that the processor cores were underutilized most of the time spent on executing this function. If you change the grouping level (highlighted in the figure above) in the Bottom-up pane from Function/ Call Stack to Thread/Function/Call Stack, you see that the initialize_2D_buffer function belongs to the thread_video thread. This thread is also identified as a hotspot and shows up at the top in the Bottomup pane. To get detailed information on the hotspot thread performance, explore the Timeline pane. Timeline area. When you hover over the graph element, the timeline tooltip displays the time passed since the application has been launched. Threads area that shows the distribution of CPU time utilization per thread. Hover over a bar to see the CPU time utilization in percent for this thread at each moment of time. Green zones show the time threads are active. CPU Usage area that shows the distribution of CPU time utilization for the whole application. Hover over a bar to see the application-level CPU time utilization in percent at each moment of time. VTune Amplifier XE calculates the overall CPU Usage metric as the sum of CPU time per each thread of the Threads area. Maximum CPU Usage value is equal to [number of processor cores] x 100%. 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 32The Timeline analysis also identifies the thread_video thread as the most active. The tooltip shows that CPU time values rarely exceed 100% whereas the maximum CPU time value for dual-core systems is 200%. This means that the processor cores were half-utilized for most of the time spent on executing the tachyon_find_hotspots application. Recap You identified a function that took the most CPU time and could be a good candidate for algorithm tuning. Key Terms and Concepts • Term: Elapsed time, CPU time, viewpoint • Concept: Hotspots Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified initialize_2D_buffer as the hottest function. In the Bottom-up pane, double-click this function to open the Source window and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source Window Options The table below explains some of the features available in the Source window when viewing the Hotspots analysis data. Source pane displaying the source code of the application if the function symbol information is available. The code line that took the most CPU time to execute is highlighted. The source code in the Source pane is not editable. Tutorial: Finding Hotspots 1 33If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected hotspot function. To enable the Source pane, make sure to build the target properly. Assembly pane displaying the assembler instructions for the selected hotspot function. Assembler instructions are grouped by basic blocks. The assembler instructions for the selected hotspot function are highlighted. To get help on an assembler instruction, right-click the instruction and select Instruction Reference. NOTE To get the help on a particular instruction, make sure to have the Adobe* Acrobat Reader* 9 (or later) installed. If an earlier version of the Adobe Acrobat Reader is installed, the Instruction Reference opens but you need to locate the help on each instruction manually. Processor time attributed to a particular code line. If the hotspot is a system function, its time, by default, is attributed to the user function that called this system function. Source window toolbar. Use the hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Hotspots analysis, this is CPU Time. Use the Source/Assembly buttons to toggle the Source/Assembly panes (if both of them are available) on/off. Heat map markers to quickly identify performance-critical code lines (hotspots). The bright blue markers indicate hot lines for the function you selected for analysis. Light blue markers indicate hot lines for other functions. Scroll to a marker to locate the hot code line it identifies. Identify the Hottest Code Lines When you identify a hotspot in the serial code, you can make some changes in the code to tune the algorithms and speed up that hotspot. Another option is to parallelize the sample code by adding threads to the application so that it performs well on multi-core processors. This tutorial focuses on algorithm tuning. By default, when you double-click the hotspot in the Bottom-up pane, VTune Amplifier XE opens the source file related to this function highlighting the code line that took the most CPU time. For the initialize_2D_buffer function, the hottest code line is 84. This code is used to initialize a memory array using non-sequential memory locations. Click the Source Editor button on the Source window toolbar to open the default code editor and work on optimizing the code. Recap You identified the code section that took the most CPU time to execute. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis, Data of Interest Next Step Tune Algorithms Tune Algorithms 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 34 In the Source window, you identified that in the initialize_2D_buffer hotspot function the code line 84 took the most CPU time. Focus on this line and do the following: 1. Open the code editor. 2. Resolve the performance problem using any of these options: • Optimize the algorithm used in this code section. • Recompile the code with the Intel® Compiler. Open the Code Editor In the Source window, click the Source Editor button to open the find_hotspots.cpp file in the default code editor at the hotspot line: Hotspot line 84 is used to initialize a memory array using non-sequential memory locations. For demonstration purposes, the code lines are commented as a slower method of filling the array. Resolve the Problem To resolve this issue, use one of the following methods: Option 1: Optimize your algorithm 1. Edit line 79 to comment out code lines 82-88 marked as a "First (slower) method". 2. Edit line 95 to uncomment code lines 98-104 marked as a "Faster method". In this step, you interchange the for loops to initialize the code in sequential memory locations. Tutorial: Finding Hotspots 1 353. From the Visual Studio menu, select Build > Rebuild find_hotspots. The project is rebuilt. 4. From Visual Studio Debug menu, select Start Without Debugging to run the application. Visual Studio runs the tachyon_find_hotspots.exe. Note that execution time has reduced from 63.609 seconds to 57.282 seconds. Option 2: Recompile the code with Intel ® Compiler This option assumes that you have Intel ® Composer XE installed. Composer XE is part of Intel ® Parallel Studio XE. By default, the Intel ® Compiler, one of the Composer components, uses powerful optimization switches, which typically provides some gain in performance. For more details on the Intel compiler, see the Intel Composer documentation. As an alternative, you may consider running the default Microsoft Visual Studio compiler applying more aggressive optimization switches. To recompile the code with the Intel compiler: 1. From Visual Studio Project menu, select Intel Composer XE> Use Intel C++.... 2. In the Confirmation window, click OK to confirm your choice. The project in Solution Explorer appears with the ComposerXE icon: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 363. From the Visual Studio menu, select Build > Rebuild find_hotspots. The project is rebuilt with the Intel compiler. 4. From the Visual Studio menu, select Debug > Start Without Debugging. Visual Studio runs the tachyon_find_hotspots.exe. Note that the execution time reduced. Recap You interchanged the loops in the hotspot function, rebuilt the application, and got performance gain of 6 seconds. You also considered an alternative optimization technique using the Intel C++ compiler. Key Terms and Concepts • Term: hotspot Next Step Compare with Previous Result Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Compare with Previous Result You optimized your code to apply a loop interchange mechanism that gave you 6 seconds of improvement in the application execution time. To understand whether you got rid of the hotspot and what kind of optimization you got per function, re-run the Hotspots analysis on the optimized code and compare results: • Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Hotspots analysis on the modified code. 2. Click the Compare Results button on the Intel ® VTune™ Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Hotspots analysis results you want to compare and click the Compare Results button: Tutorial: Finding Hotspots 1 37The Hotspots Bottom-up window opens, showing the CPU time usage across the two results and the differences side by side. Difference in CPU time between the two results in the following format: = – . CPU time for the initial version of the tachyon_find_hotspots.exe application. CPU time for the optimized version of the tachyon_find_hotspots.exe. Identify the Performance Gain Explore the Bottom-up pane to compare CPU time data for the first hotspot: CPU Time:r000hs - CPU Time:r001hs = CPU Time: Difference. 27.671s - 21.321s = 6.350s, which means that you got the optimization of ~6 seconds for the initialize_2D_buffer function. If you switch to the Summary window, you see that the Elapsed time also shows 3.6 seconds of optimization for the whole application execution: 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 38Recap You ran the Hotspots analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, CPU time • Concept: Hotspots Analysis Next Step Read Summary Summary You have completed the Finding Hotspots tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for hotspots: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the performance per function. Focus on the hotspots - functions that took the most CPU time. By default, they are located at the top of the table. • Double-click the hotspot function in the Bottom-up pane or Call Stack pane to open its source code at the code line that took the most CPU time. • Consider using Intel ® Compiler, part of the Intel ® Composer XE, to optimize your tuning algorithms. Explore the compiler documentation for more details. Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. Tutorial: Finding Hotspots 1 39Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 40Tutorial: Analyzing Locks and Waits 2 Learning Objectives This tutorial shows how to use the Locks and Waits analysis of the Intel® VTune™ Amplifier XE to identify one of the most common reasons for an inefficient parallel application - threads waiting too long on synchronization objects (locks) while processor cores are underutilized. Focus your tuning efforts on objects with long waits where the system is underutilized. Estimated completion time: 15 minutes. Sample application: tachyon. After you complete this tutorial, you should be able to: • Choose an analysis target. • Choose the Locks and Waits analysis type. • Run the Locks and Waits analysis. • Identify the synchronization objects with long waits and poor CPU utilization. • Analyze the source code to locate the most critical code lines. • Compare results before and after optimization. Start Here Workflow Steps to Identify Locks and Waits Workflow Steps to Identify Locks and Waits You can use the Intel® VTune™ Amplifier XE to understand the cause of the ineffective processor utilization by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon. 411. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application. • Standalone GUI: Build an application to analyze for locks and waits and create a new VTune Amplifier XE project. 2. Run the Locks and Waits analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical function. 5. Modify the code to remove the lock. 6. Re-build the target, re-run the Locks and Waits analysis, and compare the result data before and after optimization. Visual Studio* IDE: Choose Project and Build Application Before you start analyzing your application for locks, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. 5. Run the application without debugging to create a performance baseline. For this tutorial, your target is a ray-tracer application, tachyon. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE may slightly differ. See online help for details. • Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 42Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location you used to unzip the tachyon_vtune_amp_xe.zip file and select the tachyon_vtune_amp_xe.sln file. The solution is added to Visual Studio and shows up in the Solution Explorer. 3. In Solution Explorer, right-click the analyze_locks project and select Project > Set as StartUp Project. analyze_locks appears in bold in Solution Explorer. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft* Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Analyzing Locks and Waits 2 437. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the analyze_locks project and go to Project > Properties. 2. From the analyze_locks Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the analyze_locks Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 444. From the analyze_locks Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Analyzing Locks and Waits 2 45Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build analyze_locks. The tachyon_analyze_locks application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 462. Note the execution time displayed in the window caption. For the tachyon_analyze_locks executable in the figure above, the execution time is 33.578 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Recap You selected the analyze_locks project as the target for the Locks and Waits analysis. Key Terms and Concepts • Term: target Next Step Run Locks and Waits Analysis Tutorial: Analyzing Locks and Waits 2 47Standalone GUI: Build Application and Create New Project Before you start analyzing your application for locks and waits, do the following: 1. Build application. If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Run the application without debugging to create a performance baseline. 3. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE may differ slightly. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http:// msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft* Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 487. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the analyze_locks project and go to Project > Properties. 2. From the analyze_locks Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the analyze_locks Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). Tutorial: Analyzing Locks and Waits 2 494. From the analyze_locks Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 50Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build analyze_locks. The tachyon_analyze_locks application is built. NOTE The build configuration for tachyon may initially be set to Debug, which is typically used for development. When analyzing performance issues with the VTune Amplifier XE, you are recommended to use the Release build with normal optimizations. In this way, the VTune Amplifier XE is able to analyze the realistic performance of your application. Create a Performance Baseline 1. From the Visual Studio menu, select Debug > Start Without Debugging. The tachyon_analyze_locks application runs in multiple sections (depending on the number of CPUs in your system). NOTE Before you start the application, minimize the amount of other software running on your computer to get more accurate results. Tutorial: Analyzing Locks and Waits 2 512. Note the execution time displayed in the window caption. For the tachyon_analyze_locks executable in the figure above, the execution time is 33.578 seconds. The total execution time is the baseline against which you will compare subsequent runs of the application. NOTE Run the application several times, note the execution time for each run, and use the average number. This helps to minimize skewed results due to transient system activity. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE standalone GUI. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name tachyon that will be used as the project directory name. VTune Amplifier XE creates a project directory under the %USERPROFILE%\My Documents\My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Application to Launch pane of the Target tab, specify and configure your target as follows: • For the Application field, browse to: \analyze_locks.exe. 5. Click OK to apply the settings and exit the Project Properties dialog box. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 52Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, created the performance baseline, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Key Terms and Concepts • Term: target, baseline • Concept: Locks and Waits Analysis Next Step Run Locks and Waits Analysis Run Locks and Waits Analysis Before running an analysis, choose a configuration level to define the Intel® VTune™ Amplifier XE analysis scope and running time. In this tutorial, you run the Locks and Waits analysis to identify synchronization objects that caused contention and fix the problem in the source. To run an analysis: 1. From the VTune Amplifier XE toolbar, analysis type from the drop-down menuclick the New Analysis button. The VTune Amplifier XE result tab opens with the Analysis Type window active. 2. From the analysis tree on the left, select Algorithm Analysis > Locks and Waits. The right pane is updated with the default options for the Locks and Waits analysis. 3. Click the Start button on the right command bar. The VTune Amplifier XE launches the tachyon_analyze_locks executable that renders balls.dat as an input file, calculates the execution time, and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Locks and Waits viewpoint. NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Tutorial: Analyzing Locks and Waits 2 53Recap You ran the Locks and Waits data collection that analyzes how long the application had to wait on each synchronization object, or on blocking APIs, such as sleep() and blocking I/O, and estimates processor utilization during the wait. NOTE This tutorial explains how to run an analysis from the VTune Amplifier XE graphical user interface (GUI). You can also use the VTune Amplifier XE command-line interface (amplxe-cl command) to run an analysis. For more details, check the Command-line Interface Support section of the VTune Amplifier XE Help. Key Terms and Concepts • Term: viewpoint • Concept: Locks and Waits Analysis, Finalization Next Step Interpret Result Data Interpret Result Data When the sample application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Locks and Waits viewpoint that consists of the Summary window, Bottom-up pane, Top-down Tree pane, Call Stack pane, and Timeline pane. To interpret the data on the sample code performance, do the following: • Analyze the basic performance metrics provided by the Locks and Waits analysis. • Identify locks. NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Analyze the Basic Locks and Waits Metrics Start with exploring the data provided in the Summary window for the whole application performance. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. The Result Summary section provides data on the overall application performance per the following metrics: 1) Elapsed Time is the total time for each core when it was either waiting or not utilized by the application; 2)Total Thread Count is the number of threads in the application; 3)Wait Time is the amount of time the application threads waited for some event to occur, such as synchronization waits and I/O waits; 4) Wait Count is the overall number of times the system wait API was called for the analyzed application; 5) CPU Time is the sum of CPU time for all threads; 6) Spin Time is the time a thread is active in a synchronization construct. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 54For the tachyon_analyze_locks application, the Wait time is high. To identify the cause, you need to understand how this Wait time was distributed per synchronization objects. The Top Waiting Objects section provides the list of five synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric. For the tachyon_analyze_locks application, focus on the first three objects and explore the Bottom-up pane data for more details. The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Note the Target value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal. The Average metric is calculated as CPU time / Elapsed time. Use this number as a baseline for your performance measurements. The closer this number to the number of cores, the better. For the sample code, the chart shows that tachyon_analyze_locks is a multithreaded application running four threads on a machine with four cores. But it is not using available cores effectively. The Average CPU Usage on the chart is about 0.7 while your target should be making it as closer to 4 as possible (for the system with four cores). Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 15 seconds, which is classified as Poor concurrency. The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range. Tutorial: Analyzing Locks and Waits 2 55The tachyon_analyze_locks application ran mostly on one logical CPU. If you hover over the second bar, you see that it spent 16.603 seconds using one core only, which is classified by the VTune Amplifier XE as a Poor utilization. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane. Identify Locks Click the Bottom-up tab to open the Bottom-up pane. Synchronization objects that control threads in the application. The hash (unique number) appended to some names of the objects identify the stack creating this synchronization object. For Intel ® Threading Building Blocks (Intel ® TBB), VTune Amplifier XE is able to recognize all types of Intel TBB objects. To display an overhead introduced by Intel TBB library internals, the VTune Amplifier XE creates a pseudo synchronization object TBB scheduler that includes all waits from the Intel TBB runtime libraries. The utilization of the processor time when a given thread waited for some event to occur. By default, the synchronization objects are sorted by Poor processor utilization type. Bars showing OK or Ideal utilization (orange and green) are utilizing the processors well. You should focus your optimization efforts on functions with the longest poor CPU utilization (red bars if the bar format is selected). Next, search for the longest over-utilized time (blue bars). This is the Data of Interest column for the Locks and Waits analysis results that is used for different types of calculations, for example: call stack contribution, percentage value on the filter toolbar. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 56Number of times the corresponding system wait API was called. For a lock, it is the number of times the lock was contended and caused a wait. Usually you are recommended to focus your tuning efforts on the waits with both high Wait Time and Wait Count values, especially if they have poor utilization. Wait time, during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some Spin time may be preferable to the alternative of the increased thread context switches. However, too much Spin time can reflect lost opportunity for productive work. For the analyzed sample code, you see that the top three synchronization objects caused the longest Wait time. The red bars in the Wait Time column indicate that most of the time for these objects processor cores were underutilized. From the code knowledge, you may understand that the Manual and Auto Reset Event objects are most likely related to the join where the main program is waiting for the worker threads to finish. This should not be a problem. Consider the third item in the Bottom-up pane that is more interesting. It is a Critical Section that shows much serial time and is causing a wait. Click the plus sign at the object name to expand the node and see the draw_task wait function that contains this critical section and call stack. Double-click the Critical Section to see the source code for the wait function. Recap You identified a synchronization object with the high Wait Time and Wait Count values and poor CPU utilization that could be a lock affecting application parallelism. Your next step is to analyze the code of this function. Key Terms and Concepts • Term: Elapsed time, Wait time • Concept: Locks and Waits Analysis, CPU Usage, Data of Interest Next Step Analyze Code Analyze Code You identified the critical section that caused significant Wait time and poor processor utilization. Double-click this critical section in the Bottom-up pane to view the source. The Intel® VTune™ Amplifier XE opens source and disassembly code. Focus on the Source pane and analyze the source code: • Understand basic options provided in the Source window. • Identify the hottest code lines. Understand Basic Source View Options Tutorial: Analyzing Locks and Waits 2 57The table below explains some of the features available in the Source panefor the Locks and Waits viewpoint. Source code of the application displayed if the function symbol information is available. When you go to the source by double-clicking the synchronization object in the Bottom-up pane, the VTune Amplifier XE opens the wait function containing this object and highlights the code line that took the most Wait time. The source code in the Source pane is not editable. If the function symbol information is not available, the Assembly pane opens displaying assembler instructions for the selected wait function. To view the source code in the Source pane, make sure to build the target properly. Processor time and utilization bar attributed to a particular code line. The colored bar represents the distribution of the Wait time according to the utilization levels (Idle, Poor, Ok, Ideal, and Over) defined by the VTune Amplifier XE. The longer the bar, the higher the value. Ok utilization level is not available for systems with a small number of cores. This is the Data of Interest column for the Locks and Waits analysis. Number of times the corresponding system wait API was called while this code line was executing. For a lock, it is the number of times the lock was contended and caused a wait. Source window toolbar. Use hotspot navigation buttons to switch between most performancecritical code lines. Hotspot navigation is based on the metric column selected as a Data of Interest. For the Locks and Waits analysis, this is Wait Time. Use the source file editor button to open and edit your code in your default editor. Identify the Hottest Code Lines The VTune Amplifier XE highlights line 170 entering the critical section rgb_critical_section in the draw_task function. The draw_task function was waiting for almost 27 seconds while this code line was executing and most of the time the processor was underutilized. During this time, the critical section was contended 438 times. The rgb_critical section is the place where the application is serializing. Each thread has to wait for the critical section to be available before it can proceed. Only one thread can be in the critical section at a time. You need to optimize the code to make it more concurrent. Click the Source Editor button on the Source window toolbar to open the code editor and optimize the code. Recap You identified the code section that caused a significant wait and during which the processor was poorly utilized. Key Terms and Concepts • Term: Wait time • Concept: CPU Usage, Locks and Waits Analysis, Data of Interest Next Step Remove Lock Remove Lock In the Source window, you located the critical section that caused a significant wait while the processor cores were underutilized and generated multiple wait count. Focus on this line and do the following: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 581. Open the code editor. 2. Modify the code to remove the lock. Open the Code Editor Click the Source Editor button to open the analyze_locks.cpp file in your default editor at the hotspot code line: Remove the Lock The rgb_critical_section was introduced to protect calculation from multithreaded access. The brief analysis shows that the code is thread safe and the critical section is not really needed. To resolve this issue: NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the VTune Amplifier XE may slightly differ. 1. Comment out code lines 170 and 178 to disable the critical section. 2. From Solution Explorer, select the analyze_locks project. 3. From Visual Studio menu, select Build > Rebuild analyze_locks. The project is rebuilt. 4. From Visual Studio menu, select Debug > Start Without Debugging to run the application. Visual Studio runs the tachyon_analyze_locks.exe. Note that execution time reduced from 33.578 seconds to 20.328 seconds. Tutorial: Analyzing Locks and Waits 2 59Recap You optimized the application execution time by removing the unnecessary critical section that caused a lot of Wait time. Key Terms and Concepts • Term: hotspot • Concept: Locks and Waits Analysis Next Step Compare with Previous Result Compare with Previous Result You made sure that removing the critical section gave you 13 seconds of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results: 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 60• Compare results before and after optimization. • Identify the performance gain. Compare Results Before and After Optimization 1. Run the Locks and Waits analysis on the modified code. 2. Click the Compare Results button on the Intel ® VTune™ Amplifier XE toolbar. The Compare Results window opens. 3. Specify the Locks and Waits analysis results you want to compare: The Summary window opens providing the statistics for the difference between collected results. Click the Bottom-up tab to see the list of synchronization objects used in the code, Wait time utilization across the two results, and the differences side by side: Difference in Wait time per utilization level between the two results in the following format: = – . By default, the Difference column is expanded to display comparison data per utilization level. You may collapse the column to see the total difference data per Wait time. Wait time and CPU utilization for the initial version of the code. Wait time and CPU utilization for the optimized version of the code. Difference in Wait count between the two results in the following format: = - . Tutorial: Analyzing Locks and Waits 2 61Wait count for the initial version of the code. Wait count for the optimized version of the code. Identify the Performance Gain The Elapsed time data in the Summary window shows the optimization of 4 seconds for the whole application execution and Wait time decreased by 37.5 seconds. According to the Thread Concurrency histogram, before optimization (blue bar) the application ran serially for 9 seconds poorly utilizing available processor cores but after optimization (orange bar) it ran serially only for 2 seconds. After optimization the application ran 5 threads simultaneously overutilizing the cores for almost 5 seconds. Further, you may consider this direction as an additional area for improvement. In the Bottom-up pane, locate the Critical Section you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r001lw does not show any performance data for this synchronization object. If you collapse the Wait Time:Difference column by clicking the button, you see that with the optimized result you got almost 29 seconds of optimization in Wait time. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 62Recap You ran the Locks and Waits analysis on the optimized code and compared the results before and after optimization using the Compare mode of the VTune Amplifier XE. The comparison shows that, with the optimized version of the tachyon_analyze_locks application (r001lw result), you managed to remove the lock preventing application parallelism and significantly reduce the application execution time. Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier XE command-line interface and run the amplxecl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier XE online help. Key Terms and Concepts • Term: hotspot, Wait time • Concept: Locks and Waits Analysis, CPU Usage Next Step Read Summary Summary You have completed the Analyzing Locks and Waits tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for locks and waits: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. • Create a performance baseline to compare the application versions before and after optimization. Make sure to use the same workload for each application run. • Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. For example, you may limit the data collection to a predefined amount of data or enable the VTune Amplifier XE to collect more accurate CPU time data. You can also run the analysis from command line using the amplxecl command. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application with the Summary pane to explore the performance metrics for the whole application. Then, move to the Bottom-up window to analyze the synchronization objects. Focus on the synchronization objects that under- or over-utilized the available logical CPUs and have the highest Wait time and Wait Count values. By default, the objects with the highest Wait time values show up at the top of the window. • Expand the most time-critical synchronization object in the Bottom-up pane and double-click the wait function it belongs to. This opens the source code for this wait function at the code line with the highest Wait time value. Tutorial: Analyzing Locks and Waits 2 63Step 4. Compare Results Before and After Optimization • Perform regular regression testing by comparing analysis results before and after optimization. From GUI, click the Compare Results button on the VTune Amplifier XE toolbar. From command line, use the amplxe-cl command. • Expand each data column by clicking the button to identify the performance gain per CPU utilization level. 2 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 64Tutorial: Identifying Hardware Issues 3 Learning Objectives This tutorial shows how to use the General Exploration analysis of the Intel® VTune™ Amplifier XE to identify the hardware-related issues in the sample application. Estimated completion time: 15 minutes. Sample application: matrix. After you complete this tutorial, you should be able to: • Choose an analysis target. • Run the General Exploration analysis for Intel® microarchitecture code name Nehalem. • Understand the event-based performance metrics. • Identify the types of the most critical hardware issues for the application as a whole. • Identify the modules/functions that caused the most critical hardware issues. • Analyze the source code to locate the most critical code lines. • Identify the next steps of the performance analysis to get more detailed results. Start Here Workflow Steps to Identify Hardware Issues Workflow Steps to Identify Hardware Issues You can use an advanced event-based sampling analysis of the Intel® VTune™ Amplifier XE to identify the most significant hardware issues that affect the performance of your application. This tutorial guides you through these workflow steps running the General Exploration analysis type on a sample matrix application. 651. Do one of the following: • Visual Studio* IDE: Choose a project, verify settings, and build application. • Standalone GUI: Build an application to analyze for hardware issues and create a new VTune Amplifier XE project. 2. Choose and run the General Exploration analysis. 3. Interpret the result data. 4. View and analyze code of the performance-critical functions. 5. Modify the code to resolve the detected performance issues and rebuild the code. Visual Studio* IDE: Choose Project and Build Application Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Choose a project with the analysis target in the Visual Studio IDE. 2. Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that the VTune Amplifier XE can properly identify system functions and classify and attribute functions. 3. Configure Visual Studio project properties to generate the debug information for your application so that the VTune Amplifier XE can open the source code. 4. Build the target in the release mode with full optimizations, which is recommended for performance analysis. For this tutorial, your target is the matrix application that calculates matrix transformations. To learn how to install and set up the sample code, see Prerequisites. • The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE 2011 may slightly differ. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 66• Steps provided by this tutorial are generic and applicable to any application. You may choose to follow the proposed workflow using your own application. Choose a Project 1. From the Visual Studio menu, select File > Open > Project/Solution.... The Open Project dialog box opens. 2. In the Open Project dialog box, browse to the location where you extracted the matrix_vtune_amp_xe.zip file and select the matrix.sln file. The solution is added to Visual Studio and shows up in the Solution Explorer. VTune Amplifier XE automatically inherits Visual Studio settings and uses the currently opened project as a target project for performance analysis. When you choose a project in Visual Studio IDE, the VTune Amplifier XE automatically creates the config.amplxeproj project file and sets the matrix application as an analysis target in the project properties. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http://msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. Tutorial: Identifying Hardware Issues 3 677. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the matrix project and go to Project > Properties. 2. From the matrix Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the matrix Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 684. From the matrix Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). Tutorial: Identifying Hardware Issues 3 69Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build matrix. The matrix.exe application is built. Recap You selected the matrix project as the target for the hardware event-based sampling analysis, set up your environment to enable generating symbol information for system libraries and your binary files, and built the target in the Release mode. Your application is ready for analysis. Next Step Run General Exploration Analysis Standalone GUI: Build Application and Create New Project Before you start analyzing hardware issues affecting the performance of your application, do the following: 1. Build application. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 70If you build the code in Visual Studio*, make sure to: • Configure the Microsoft Visual Studio* environment to download the debug information for system libraries so that the VTune Amplifier XE can properly identify system functions and classify and attribute functions. • Configure Visual Studio project properties to generate the debug information for your application so that the VTune Amplifier XE can open the source code. • Build the target in the release mode with full optimizations, which is recommended for performance analysis. 2. Create a VTune Amplifier XE project. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the Intel® VTune™ Amplifier XE may differ slightly. Enable Downloading the Debug Information for System Libraries 1. Go to Tools > Options.... The Options dialog box opens. 2. From the left pane, select Debugging > Symbols. 3. In the Symbol file (.pdb) locations field, click the button and specify the following address: http://msdl.microsoft.com/download/symbols. 4. Make sure the added address is checked. 5. In the Cache symbols from symbol servers to this directory field, specify a directory where the downloaded symbol files will be stored. 6. For Microsoft Visual Studio* 2005, check the Load symbols using the updated settings when this dialog is closed box. 7. Click Ok. Enable Generating Debug Information for Your Binary Files 1. Select the matrix project and go to Project > Properties. Tutorial: Identifying Hardware Issues 3 712. From the matrix Property Pages dialog box, select Configuration Properties > General and make sure the selected Configuration (top of the dialog) is Active(Release). 3. From the matrix Property Pages dialog box, select C/C++ > General pane and specify the Debug Information Format as Program Database (/Zi). 4. From the matrix Property Pages dialog box, select Linker > Debugging and set the Generate Debug Info option to Yes (/DEBUG). 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 72Choose a Build Mode and Build a Target 1. Go to the Build > Configuration Manager... dialog box and select the Release mode for your target project. 2. From the Visual Studio menu, select Build > Build matrix. The matrix.exe application is built. Create a Project 1. From the Start menu select Intel Parallel Studio XE 2011 > Intel VTune Amplifier XE 2011 to launch the VTune Amplifier XE GUI client. 2. Create a new project via File > New > Project.... The Create a Project dialog box opens. 3. Specify the project name matrix that will be used as the project directory name and click the Create Project button. By default, the VTune Amplifier XE creates a project directory under the %USERPROFILE%\My Documents \My Amplifier XE Projects directory and opens the Project Properties: Target dialog box. 4. In the Target: Application to Launch pane, browse to the matrix.exe application and click OK. Recap You set up your environment to enable generating symbol information for system libraries and your binary files, built the target in the Release mode, and created the VTune Amplifier XE project for your analysis target. Your application is ready for analysis. Tutorial: Identifying Hardware Issues 3 73Key Terms and Concepts • Term: target • Concept: Event-based Sampling Analysis Next Step Run General Exploration Analysis Run General Exploration Analysis After building the target, you can run it with the Intel® VTune™ Amplifier XE to analyze its performance. In this tutorial, you run the General Exploration analysis on the Intel® Core™ i7 processor based on the Intel® microarchitecture code name Nehalem. The General Exploration analysis type helps identify the widest scope of hardware issues that affect the application performance. This analysis type is based on the hardware event-based sampling collection. NOTE The steps below are provided for Microsoft Visual Studio* 2005. Steps for other versions of Visual Studio IDE or for the standalone version of the VTune Amplifier XE may slightly differ. To run the analysis: 1. From the VTune Amplifier XE toolbar, click the New Analysis button. The New Amplifier XE Result tab opens with the Analysis Type configuration window active. 2. From the analysis tree on the left, select the Advanced Intel(R) Microarchitecture Code Name Nehalem Analysis > General Exploration analysis type. 3. Click the Start button on the right to run the analysis. The VTune Amplifier XE launches the matrix application that calculates matrix transformations and exits. The VTune Amplifier XE finalizes the collected data and opens the results in the Hardware Issues viewpoint. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 74NOTE To make sure the performance of the application is repeatable, go through the entire tuning process on the same system with a minimal amount of other software executing. Recap You ran the General Exploration analysis that monitors how your application performs against a set of eventbased hardware metrics. To see the list of processor events used for this analysis type, see the Details section of the General Exploration configuration pane. Key Terms and Concepts • Term: viewpoint • Concept: Event-based Sampling Analysis, Finalization Next Step Interpret Results Interpret Results When the application exits, the Intel® VTune™ Amplifier XE finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following: • Understand the event-based metrics • Identify the hardware issues that affect the performance of your application NOTE The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system. Understand the Event-based Metrics Click the Summary tab to explore the data provided in the Summary window for the whole application performance. Tutorial: Identifying Hardware Issues 3 75Elapsed time is the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric. Event-based performance metrics. Each metric is an event ratio provided by Intel architects. Mouse over the yellow icon to see the metric description and formula used for the metric calculation. Values calculated for each metric based on the event count. VTune Amplifier XE highlights those values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue. The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description. Quick look at the summary results discovers that the matrix application has the following issues: • CPI (Clockticks per Instructions Retired) Rate • Retire Stalls • LLC Miss • LLC Load Misses Serviced by Remote DRAM • Execution Stalls • Data Sharing 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 76Identify the Hardware Issues Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot. By default, the VTune Amplifier XE sorts data in the descending order by Clockticks and provides the hotspots at the top of the list. You see that the multiply1 function is the most obvious hotspot in the matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function. NOTE Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit. For the multiply1 function, the VTune Amplifier XE highlights the same issues (except for the Data Sharing issue) that were detected as the issues affecting the performance of the whole application: • CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window. • The Retire Stalls metric shows that during the execution of the multiply1 function, about 90% (0.902) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause. Tutorial: Identifying Hardware Issues 3 77• LLC misses metric shows that about 60% (0.592) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system. • LLC Load Misses Serviced by Remote DRAM metric shows that 34% (0.340) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on. • Execution Stalls metric shows that 54% (0.543) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing div operations with right-shifts or try to reduce the latency of memory accesses. Recap You analyzed the data provided in the Hardware Issues viewpoint, explored the event-based metrics, and identified the areas where your sample application had hardware issues. Also, you were able to identify the exact function with poor performance per metrics and that could be a good candidate for further analysis. Key Terms and Concepts • Term: viewpoint, baseline, Elapsed time • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Analyze Code Analyze Code You identified a hotspot function with a number of hardware issues. Double-click the multiply1 function in the Bottom-up window to open the source code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 78The table below explains some of the features available in the Source pane when viewing the event-based sampling analysis data. Source pane displaying the source code of the application, which is available if the function symbol information is available. The code line that took the highest number of Clockticks samples is highlighted. The source code in the Source pane is not editable. Values per hardware event attributed to a particular code line. By default, the data is sorted by the Clockticks event count. Focus on the events that constitute the metrics identified as performancecritical in the Bottom-up window. To identify these events, mouse over the metric column header in the Bottom-up window. Drag-and-drop the columns to organize the view for your convinience. VTune Amplifier XE remembers yours settings and restores them each time you open the viewpoint. Hotspot navigation buttons to switch between code lines that took a long time to execute. Source file editor button to open and edit your code in the default editor. Assembly button to toggle in the Assembly pane that displays assembly instructions for the selected function. In the Source pane for the multiply1 function, you see that line 39 took the most of the Clockticks event samples during execution. This code section multiplies matrices in the loop but ineffectively accesses the memory. Focus on this section and try to reduce the memory issues. Recap You analyzed the code for the hotspot function identified in the Bottom-up window and located the hotspot line that generated a high number of CPU Clockticks. Key Terms and Concepts • Concept: Event Skid Next Step Resolve Issue Resolve Issue In the Source pane, you identified that in the multiply1 function the code line 39 resulted in the highest values for the Clockticks event. To solve this issue, do the following: • Change the multiplication algorithm and, if using the Intel® compiler, enable vectorization. • Re-run the analysis to verify optimization. Change Algorithm NOTE The proposed solution is one of the multiple ways to optimize the memory access and is used for demonstration purposes only. 1. Open the matrix.c file from the Source Files of the matrix project. For this sample, the matrix.c file is used to initialize the functions used in the multiply.c file. 2. In line 90, replace the multiply1 function name with the multiply2 function. This new function uses the loop interchange mechanism that optimizes the memory access in the code. Tutorial: Identifying Hardware Issues 3 79The proposed optimization assumes you may use the Intel ® C++ Compiler to build the code. Intel compiler helps vectorize the data, which means that it uses SIMD instructions that can work with several data elements simultaneously. If only one source file is used, the Intel compiler enables vectorization automatically. The current sample uses several source files, that is why the multiply2 function uses #pragma ivdep to instruct the compiler to ignore assumed vector dependencies. This information lets the compiler enable the Supplemental Streaming SIMD Extensions (SSSE). 3. Save files and rebuild the project using the compiler of your choice. If you have the Intel ® Composer XE installed, you may use it to build the project with the Intel ® C++ Compiler XE. To do this, select Intel Composer XE > Use Intel C++... from the Visual Studio Project menu and then Build > Rebuild matrix. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the New Analysis button and select Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r001ge, that opens automatically. 2. In the r001ge result, click the Summary tab to see the Elapsed time value for the optimized code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 80You see that the Elapsed time has reduced from 56.740 seconds to 9.122 seconds and the VTune Amplifier XE now identifies only two types of issues for the application performance: high CPI Rateand Retire Stalls. Recap You solved the memory access issue for the sample application by interchanging the loops and sped up the execution time. You also considered using the Intel compiler to enable instruction vectorization. Key Terms and Concepts • Concept: Event-based Sampling Analysis Next Step Resolve Next Issue Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Tutorial: Identifying Hardware Issues 3 81Resolve Next Issue You got a significant performance boost by optimizing the memory access for the multiply1 function. According to the data provided in the Summary window for your updated result, r001ge, you still have high CPI rate and Retire Stalls issues. You can try to optimize your code further following the steps below: • Analyze results after optimization • Use more advanced algorithms • Verify optimization Analyze Results after Optimization To get more details on the issues that still affect the performance of the matrix application, switch to the Bottom-up window: You see that the multiply2 function (in fact, updated multiply1 function) is still a hotspot. Double-click this function to view the source code and click both the Source and Assembly buttons on the toolbar to enable the Source and Assembly panes. 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 82In the Source pane, the VTune Amplifier XE highlights line 53 that took the highest number of Clockticks samples. This is again the section where matrices are multiplied. The Assembly pane is automatically synchronized with the Source pane. It highlights the basic blocks corresponding to the code line highlighted in the Source pane. If you compiled the application with the Intel ® Compiler, you can see that highlighted block 156 includes vectorization instructions added after your previous optimization. All vectorization instructions have the p (packed) postfix (for example, mulpd). You may use the /Qvec-report3 option of the Intel compiler to generate the compiler optimization report and see which cycles were not vectorized and why. For more details, see the Intel compiler documentation. Use More Advanced Algorithms 1. Open the matrix.c file from the Source Files of the matrix project. 2. In line 90, replace the multiply2 function name with the multiply3 function. This function enables uploading the matrix data by blocks. Tutorial: Identifying Hardware Issues 3 833. Save the files and rebuild the project. Verify Optimization 1. From the VTune Amplifier XE toolbar, click the New Analysis button and select Quick Intel(R) Microarchitecture Code Name Nehalem - General Exploration Analysis. VTune Amplifier XE reruns the General Exploration analysis for the updated matrix target and creates a new result, r002ge, that opens automatically. 2. In the r002ge result, click the Summary tab to see the Elapsed time value for the optimized code: 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 84You see that the Elapsed time has reduced a little: from 9.122 seconds to 8.896 seconds but the hardware issues identified in the previous run, CPI Rateand Retire Stalls, stayed practically the same. This means that there is more room for improvement and you can try other, more effective, mechanisms of matrix multiplication. Recap You tried optimizing the mechanism of matrix multiplication and obtained 0.2 seconds of optimization in the application execution time. Key Terms and Concepts • Concept: Event-based Sampling Analysis, Event-based Metrics Next Step Read Summary Summary You have completed the Identifying Hotspot Issues tutorial. Here are some important things to remember when using the Intel® VTune™ Amplifier XE to analyze your code for hardware issues: Step 1. Choose and Build Your Target • Configure the Microsoft* symbol server and your project properties to get the most accurate results for system and user binaries and to analyze the performance of your application at the code line level. Tutorial: Identifying Hardware Issues 3 85• Use the Project Properties: Target tab to choose and configure your analysis target. For Visual Studio* projects, the analysis target settings are inherited automatically. Step 2. Run Analysis • Use the Analysis Type configuration window to choose, configure, and run the analysis. You may choose between a predefined analysis type like the General Exploration type used in this tutorial, or create a new custom analysis type and add events of your choice. For more details on the custom collection, see the Creating a New Analysis Type topic in the product online help. Step 3. Interpret Results and Resolve the Issue • Start analyzing the performance of your application from the Summary window to explore the eventbased performance metrics for the whole application. Mouse over the yellow help icons to read the metric descriptions. Use the Elapsed time value as your performance baseline. • Move to the Bottom-up window and analyze the performance per function. Focus on the hotspots - functions that took the highest Clockticks event count. By default, they are located at the top of the table. Analyze the hardware issues detected for the hotspot functions. Hardware issues are highlighted in pink. Mouse over a highlighted value to read the issues description and see the threshold formula. • Double-click the hotspot function in the Bottom-up pane to open its source code at the code line that took the highest Clockticks event count. • Consider using Intel ® Compiler, part of the Intel ® Composer XE, to vectorize instructions. Explore the compiler documentation for more details. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 3 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 86More Resources 4 Getting Help Intel® VTune™ Amplifier XE provides a number of Getting Started tutorials. These tutorials use a sample application to demo you the basic product features and workflows. You can access these documents through the Help menu or by clicking the VTune Amplifier XE icon . From the Visual Studio user interface, select Help> Intel VTune Amplifier XE 2011 > Getting Started Tutorials and explore available tutorials. : For the standalone user interface, the tutorials are available via Help > Getting Started Tutorials menu. Browsing Help In the Visual Studio IDE, you can browse and search for topics in different ways: • Use Help > Contents to open the Contents window and browse the Table of Contents. • To view help for the VTune Amplifier XE directly, select Help > Intel VTune Amplifier XE 2011 Help. • Use Help > Index to open the Index window and access an index to VTune Amplifier XE topics. Either type in the keyword you are looking for, or scroll through the list of keywords. • Use Help > Search to open the Search page and search the full text of topics in the help. To view help in the standalone user interface, select Intel VTune Amplifier XE 2011 Help from the Help menu. Locating Intel Topics in the Document Explorer To filter the documentation so that only the Intel documentation appears, select Help > Contents from the Visual Studio user interface. In the Filtered by: drop-down list, select Intel. To determine where the currently displayed topic appears in the table of contents (TOC), click the Sync with Table of Contents button on the Visual Studio toolbar to highlight the topic in the Contents pane. Navigating in the Product Usage Workflow Where applicable, the VTune Amplifier XE help topics provide a Where am I in the workflow? button. Click the button to view the workflow with a highlight on the stage that this topic discusses. Activating Intel Search Filters in the Document Explorer With Microsoft Visual Studio 2005 and 2008, you can include Intel documentation in all search results by checking the Intel search filter box for the Language, Technology, and Content Type categories. You must check the Intel search box for all three categories to include Intel documentation in your searches. Unchecking all three Intel search boxes excludes Intel documentation from search results. The Intel search filters work in combination with other search options for each category. Using Context-Sensitive Help Context-sensitive help enables easy access to help topics on active GUI elements. The following contextsensitive help features are available on a product-specific basis: 87• ? Help: In Visual Studio, click the ? button, in the upper-right corner of the dialog box or pane to get help for the dialog box or pane. • F1 Help: Press F1 to get help for an active dialog box, property page, pane, or window. • Dynamic Help: In Visual Studio 2005/2008, select Help > Dynamic Help to open the Dynamic Help window, which displays links to relevant help topics for the current window. Product Website and Support Product Website and Support The following links provide information and support on Intel software products, including Intel ® Parallel Studio XE: • http://software.intel.com/en-us/articles/tools/ Intel ® Software Development Products Knowledge Base. • http://www.intel.com/software/products/support/ Technical support information, to register your product, or to contact Intel. For additional support information, see the Technical Support section of your Release Notes. System Requirements For detailed information on system requirements, see the Release Notes. 4 Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS 88Intel® VTune™ Amplifier XE Tutorials Troubleshooting 5 Troubleshooting Problem: Cannot open samples The sample projects are Visual Studio* 2005 projects. You may have a problem opening the sample if you have a later version of Visual Studio* software. Solution: Use the conversion wizard to convert the solution/projects to the newer version. Problem: Product is not recognized If you installed a new version of Visual Studio* software, the previously installed Intel ® VTune™ Amplifier XE may not appear in the new installation. Solution 1: If you have the VTune Amplifier XE installation execution file, run the installation program, select Modify, and follow the instructions to reintegrate the VTune Amplifier XE with your new version of Visual Studio* software. Solution 2: 1. Go to Control Panel > Add or Remove Programs. 2. Select the VTune Amplifier XE and select Modify. 3. Follow the instructions to reintegrate the VTune Amplifier XE with your new version of Visual Studio* software. Problem: The Project Properties function is disabled The Intel VTune Amplifier XE 2011 Project Properties option does not appear on the Project menu, and the icon is disabled on the VTune Amplifier XE toolbar. Solution: Make sure the item highlighted in the Solution Explorer is a valid project recognized by Visual Studio* software or a VTune Amplifier XE result. (The My Amplifier XE Results folder is a virtual project.) Problem: The Start button is disabled The Start button on the command toolbar is disabled. Solution: Make sure you specified an analysis target. If the target is not specified, click the Project Properties button on the command toolbar and enter the target name in the Application to Launch pane. For the General Exploration analysis, the Start button may be disabled if you mistakenly chose the incorrect processor type. The selected analysis type should match your processor type. 89 Intel® VTune™ Amplifier XE 2011 Release Notes 1 Intel® VTune™ Amplifier XE 2011 Release Notes for Linux Installation Guide and Release Notes Document number: 323591-001US 2 November 2011 Contents: Introduction What?s New System Requirements Technical Support Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction The Intel® VTune™ Amplifier XE 2011 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures. This document provides system requirements, installation instructions, issues and limitations, and legal information. The Intel® VTune™ Amplifier XE 2011 has a standalone graphical user interface (GUI) as well as a command-line interface (CLI). 2 What’s New The Intel® VTune™ Amplifier XE 2011 Update6 adds: ? Intel® Atom™ processors (code name Saltwell and Cedarview) support, including hardware event-based sampling analysis types and metrics for advanced tuning ? Bandwidth analysis for the 32nm Intel® processors code name EagletonIntel® VTune™ Amplifier XE 2011 Release Notes 2 ? Inline functions support (controlled by a filter bar mode) ? “Tiny” threads timeline mode ? Red Hat* Enterprise Linux 5.7 support ? Bug fixes The Intel® VTune™ Amplifier XE 2011 Update5 adds: ? Project Explorer ? Bandwidth Analysis for the 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) ? Advanced options for analyzing child processes ? Command line reports with stacks ? Support for analysis of MPI programs that use the Intel® MPI library ? Usability improvements ? Newer Linux OS support: Fedora 15, Ubuntu 11.04, Debian 5, MeeGo 1.2 Gold The Intel® VTune™ Amplifier XE 2011 Update4: ? Update 3 sometimes incorrectly presented CPU Time in the thread timeline for Hotspots and Concurrency analysis types. Different scales were used for different threads and, thereby, could confuse a user by presenting low CPU Time in one thread as the same height in the chart as high CPU Time in another thread. The values presented in the tool tip when hovering over the chart were still correct. Update 4 resolves this problem completely. ? Debian* 6.0 support ? Ubuntu* 11.04 support The Intel® VTune™ Amplifier XE 2011 Update3: ? 32nm Westmere Family of Processors (codenamed Westmere-EX) support ? Pre-defined analysis for Intel® Atom™ Processor ? Attach/detach to process for the Hotspots, Concurrency, and Locks and Waits analysis types ? Comparison mode in Summary pane Intel® VTune™ Amplifier XE 2011 Release Notes 3 The Intel® VTune™ Amplifier XE 2011 Update2: ? The 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) support including EBS based analysis types and metrics for advanced tuning ? Fedora* 14 support ? Automatic highlighting and expansion in Bottom-Up and Top-Down panes ? Tooltips for metrics description in grid panes ? Ability to import tb5/6 files from GUI ? JIT API support for Hotspots, Concurrency, and Locks and Waits analysis types ? Overhead time metric calculation for native threading synchronization The Intel® VTune™ Amplifier XE 2011 Update1: ? Red Hat* Enterprise Linux 6 support ? CentOS* 5.5 support ? Ubuntu* 10.04 support ? Data export to CSV file format ? Source / assembly toggling button ? Several bugs were fixed. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ Processor requirements ? For general operations with user interface and all data collection except Hardware eventbased sampling analysis o A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor. o For the best experience, a multi-core or multi-processor system is recommended.Intel® VTune™ Amplifier XE 2011 Release Notes 4 o Because Intel® VTune ™ Amplifier XE requires specific knowledge of assemblylevel instructions, its analysis may not operate correctly if a program contains non-Intel® instructions. In this case, run the analysis with a target executable that contains only Intel instructions. After you finish using VTune™ Amplifier XE, you can use the assembler code or optimizing compiler options that provide the non-Intel instructions. ? For Hardware event-based sampling analysis (EBS) o EBS analysis makes use of the on chip Performance Monitoring Unit (PMU) and requires a genuine Intel processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below). o EBS analysis is not supported on the Intel® Pentium 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. o However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements. o EBS analysis requires a non-virtual machine to ensure access to the on-chip PMU. EBS is not supported within a virtual machine environment. ? The list of supported processors is constantly being extended. Here is a partial list of processors where the EBS analysis is enabled: Mobile processors Intel® Atom™ Processor Intel® Core™ i7 Mobile Processor Extreme Edition Intel® Core™ i7, i5, i3 Mobile Processors Intel® Core™2 Extreme Mobile Processor Intel® Core™2 Quad Mobile Processor Intel® Core™2 Duo Mobile Processor Intel® Core™ Duo Processor Intel® Core™ Solo Processor Intel® Pentium® Mobile Processor Desktop processors Intel® Atom™ Processor Intel® Core™ i7 Desktop Processor Extreme Edition Intel® Core™ i7, i5, i3 Desktop Processors Intel® Core™2 Quad Desktop Processor Intel® Core™2 Extreme Desktop Processor Intel® Core™2 Duo Desktop Processor Server and workstation processors Intel® Xeon® processors E7-8800/4800/2800 family Intel® Xeon® processors E3-1200 familyIntel® VTune™ Amplifier XE 2011 Release Notes 5 Intel® Xeon® processors 65xx/75xx series Intel® Xeon® processors 36xx/56xx series Intel® Xeon® processors 35xx/55xx series Intel® Xeon® processors 34xx series Quad-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series Dual-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series System Memory Requirements ? At least 2 GB of RAM Disk Space Requirements ? 280 MB free disk space required for all product features and all architectures Software Requirements ? Supported Linux* distributions: o Red Hat* Enterprise Linux 4 (starting from Update 8) o Red Hat* Enterprise Linux 5 and 6 o CentOS* versions equivalent to Red Hat* Enterprise Linux* versions listed above o SUSE* Linux* Enterprise Server (SLES) 10 and 11 o Fedora* 14 and 15 o Ubuntu* 10.04, 10.10 † and 11.04 † o Debian* 5.0 and 6.0 o MeeGo* 1.1 and MeeGo* 1.2 Gold †† † VTune™ Amplifier XE supports Ubuntu* 10.10 and Ubuntu* 11.04 default configuration only for event-based sampling analysis in the command line mode. To learn how to enable all other types of analysis and GUI results, please see the solutions described in the Known Limitation section, items 200197559, 200197563, of this document. †† Please refer to the Intel® AppUp™ SDK Suite for MeeGo* documentation for more information. ? We support all OS distributions above. For your information, VTune™ Amplifier XE was qualified on the builds listed below: o Red Hat* Enterprise Linux 4 Update 8 o Red Hat* Enterprise Linux 5 Update 6 and 7 o SUSE* Linux Enterprise Server 10 Service Pack 4 o SUSE* Linux Enterprise Server 11 Service Pack 1 o Fedora* 14 and 15 o Ubuntu* 10.04 and 11.04 o Debian* 5.0 and 6.0 ? Supported compilers: o Intel® C/C++ Compiler 11 and higher o Intel® Fortran Compiler 11 and higher o GNU C/C++ Compiler 3.4.6 and higher ? Application coding requirements Intel® VTune™ Amplifier XE 2011 Release Notes 6 o Supported programming languages: ? Fortran ? C ? C++ o Concurrency and Locks and Waits analysis types interpret the use of constructs from the following threading methodologies: ? Intel® Threading Building Blocks ? Posix* Threads on Linux* ? OpenMP*[1] ? Intel's C/C++ Parallel Language Extensions ? To view PDF documents, use a PDF reader, such as Adobe Reader*. Notes: 1. VTune™ Amplifier XE supports analysis of applications built with Intel® Fortran Compiler Professional Edition version 11.0 or higher, Intel® C++ Compiler Professional Edition version 11.0 or higher, or GNU C/C++ Compiler 3.4.6. Applications that use OpenMP* technology and are built with the GNU compiler must link to the OpenMP* compatibility library as supplied by an Intel® compiler. 4 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support/ Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 5 Installation Notes If you are installing the product for the first time, please be sure to have the product serial number available so you can type it in during installation. A valid license is required for installation and use. This product package can be used to install the software on both IA-32 systems and Intel® 64 systems. The installer determines the system architecture and installs the appropriate files. Both 32-bit and 64-bit versions of the software are automatically installed on an Intel® 64 system.Intel® VTune™ Amplifier XE 2011 Release Notes 7 To begin installation, do the following: 1. gunzip and untar to retrieve the installation packages. 2. Execute the ./install.sh script file (available at the top level in the untarred contents) as a root user. Activation is required. Note: 1. To install all components to a network-mounted drive or shared file system, execute the following command in place of the one in step 2 above: ./install.sh -- SHARED_INSTALL 2. The install can be run as a non-root user, but in this case not all collectors will be available to the user. 3. For successful installation you should have read and write permissions for the /tmp directory. Installing Collectors on Remote Systems You can install the command line data collection features of the product on remote systems to reduce overhead and simply collect data remotely. Data collection on a remote system does not require a license; however, viewing of the data cannot be done on the remote system unless a license is present. The results of any data collection that is run on the remote system must then be copied to the system where the regular install was done for analysis, viewing, and reporting. To do this: 1. Copy the CLI_install folder (found at the top level in the untarred product install package) to the remote machine. 2. Execute ./install.sh script file (this file is located inside the CLI_install folder). No activation will be required. Default Installation Directories The default top-level installation directory for this product is: ? /opt/intel/vtune_amplifier_xe_2011/ This product installs into an arrangement of directories shown in the diagram below. Not all directories will be present in a given installation. ? /opt/intel/vtune_amplifier_xe_2011/ o bin32Intel® VTune™ Amplifier XE 2011 Release Notes 8 o bin64* o config o documentation o include o lib32 o lib64* o man o message o resources o sepdk o samples (*) bin64 and lib64 are available for Intel® 64 architecture install package Establishing the VTune™ Amplifier XE Environment The amplxe-vars.sh script is used to establish the VTune™ Amplifier XE environment. The command takes the form: source /amplxe-vars.sh Advanced Installation Options VTune™ Amplifier XE uses a kernel driver to enable event-based sampling (EBS) analysis. If you are not using a default kernel on the supported Linux* distributions listed above, use the SEP Driver Kit in VTune™ Amplifier XE to compile drivers for your kernel. If no pre-built drivers are provided for your kernel, VTune™ Amplifier XE installer will automatically use the SEP Driver Kit to try and build a driver for your kernel. The driver can also be built manually after the product is installed using the SEP Driver Kit. Note: additional software may be needed in order to build and load the SEP kernel driver on the Linux* operating system. For details, see the README.txt file in the sepdk/src directory. When the Advanced installation is chosen, the following options are available: ? Driver install type [ use pre-built driver (default) / build driver / driver kit files only ] If no pre-built driver for this system is found, the option will be set to 'build driver'. You may change the option to 'driver kit files only' if you don't want to build/install driver or want to do it manually after installation.Intel® VTune™ Amplifier XE 2011 Release Notes 9 ? Driver access group [ vtune (default) ] Setting the driver access group ownership is a security feature and is used to control access to the kernel module. By default the group for accessing the driver is “vtune”. You may set your own group during installation or change it manually after installation by executing './bootscript -–group ' from the sepdk/src directory. ? Load driver [ yes (default) ] By default installation loads the driver into kernel. ? Install boot script [ yes (default) ] By default installation sets up a boot script which loads the driver into the kernel each time the system is rebooted. The boot script can be disabled later by executing './boot-script -- uninstall' from the sepdk/src directory. How to activate your evaluation software after purchasing Users of evaluation versions of Intel Developer Products have a new tool that allows converting evaluation-licensed products to fully licensed products once the product is purchased and a serial number is obtained. The “Activation Tool” is a utility that allows users of evaluation products to enter a valid product Serial Number to convert the product to fully licensed status. Run the /opt/intel/ActivationTool/Activate script, and provide your purchased product serial number, either as an argument to the program, or when prompted. For example: /opt/intel/ActivationTool/Activate ABCD-123AB45C Be sure to login or “su” to root if you want the product license to be available to all system users. Removing the Product If you want to remove components from an installation, run uninstall.sh script as root user from the product installation folder. 6 Issues and Limitations Known Issues and Limitations ? Running time is attributed to a next instruction (200108041) o To collect the data about time-consuming running regions of the target, the VTune™ Amplifier XE interrupts executing target threads and attributes the time to the context IP address.Intel® VTune™ Amplifier XE 2011 Release Notes 10 o Due to the collection mechanism, the captured IP address points to the instruction occurred AFTER the one that is actually consuming most of the time. This leads to the running time attributed to next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line. o In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to the different function in tightly inlined code . ? An application which allocates massive chunks of memory may fail to work under Amplifier (200083850) o If 32-bit application allocates massive chunks of memory (close to 2 GB) in the heap, it may fail to launch under Amplifier while running fine by its own. This happens because Amplifier requires additional memory in the profiled application process for doing the analysis. The workaround could be in using larger address space (e.g. converting the project to 64-bit). ? SEP may crash certain NHM systems when deep sleep states are enabled (200149603) o On some Intel® Core™ i7 processor-based systems with C-states enabled, sampling may cause system hanging due to a known hardware issue (see errata AAJ134 inhttp://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this, disable the “Cn(ACPI Cn) report to OS” BIOS option before sampling with the VTune Amplifier XE analyzer on Intel Core™ i7 processor-based systems. ? Link to instruction guide: instruction set reference document is not positioned on description of proper instruction. (200091200) o The reference information for assembly instructions can be opened in any PDF viewer, but only Adobe Acrobat Reader* supports positioning the instruction reference document on the required page. To ensure correct functionality of this feature, you are recommended to install the latest available version of Adobe Acrobat Reader.Intel® VTune™ Amplifier XE 2011 Release Notes 11 ? Specifying too low "Sampling After Value" for some events may cause system hang due to frequent events triggering during the collection (200093394) o Use reasonable "Sampling After Value" that result in about 1000 events triggering per second. This is statistically sufficient for the data analysis. For more fine grained analysis of sampling results, decrease the "Sampling After Value" gradually observing the system responsiveness slowdown due to frequent interruptions. ? Security-enhanced Linux* is not supported (200155374) o Security-enhanced Linux* settings (SELinux) are currently not supported by the Intel® VTune™ Amplifier XE and need to be either disabled or set to permissive for a successful tool suite installation. If your Linux* distribution has SELinux enabled the following error message will be issued by the installer: o Your system is protected with Security-enhanced Linux (SELinux). We currently support only "Permissive" mode, which is not found on the system. To rectify this issue, you may either disable SELinux by - setting the line "SELINUX=disabled" in your /etc/sysconfig/selinux file - adding "selinux=0" kernel argument in lilo.conf or grub.conf files or make SELinux mode adjustment by - setting the line "SELINUX=permissive" in your /etc/sysconfig/selinux file or ask your system administrator to make SELinux mode adjustment. You may need to reboot your system after changing the system parameters. More information about SELinux can be found at http://www.nsa.gov/selinux/ ? The tool may not be able to parse correctly certain characters in an application’s command arguments passed though a shell script (200155871) o Using quotes and double quotes in the application?s command arguments may not be parsed correctly. To work around the problem, use double quotes and backslashes to screen double quotes inside. o Incorrect: „this “style” text? o Correct: "this \"style\" text" ? Event-based sampling collection cannot start if the result directory path contains non-English characters (200185851) o When you install the product on a system with language localization, make sure the path to the result directory does not contain non-English characters.Intel® VTune™ Amplifier XE 2011 Release Notes 12 ? On Ubuntu* 10.10 systems, Standalone GUI silently disappears when opening the results. (200197559) o Recommendation: Need to switch visual theme to "New wave" or switch to another window manager (e.g. KDE). ? Intel(R) VTune Amplifier XE collectors may fail to run on the Ubuntu 10.10 and Ubuntu 11.04 (200197563) o Intel(R) VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types on the Ubuntu 10.10 and Ubuntu 11.04 operating system. Once a collection is started, the message appears in the output: Failed to start profiling because the scope of ptrace() system call application is limited. To enable profiling, please set /proc/sys/kernel/yama/ptrace_scope to 0. See the Release Notes for instructions on enabling it permanently.” o To workaround this problem for the current session, set the /proc/sys/kernel/yama/ptrace_scope sysctl to 0. o To make this change permanent, set kernel.yama.ptrace_scope value to 0 at /etc/sysctl.d/10-ptrace.conf file using root permissions and reboot the machine. ? VTune™ Amplifier XE may be killed while opening results on Ubuntu 10.10 or later if no license is provided (200197888) o This happens due to checking a license with enabled trusted storage. Possible workaround is disabling the ptrace protection in OS by using the command: echo 0 | tee /proc/sys/kernel/yama/ptrace_scope o However, normally it?s expected that a license is provided for the product before using it. ? VTune Amplifier XE collectors may crash or produce corrupted data while profiling stripped binaries. (200165647) o VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types if the main executable of an analysis target statically links some symbols from libc.so or libpthread.so (for example, pthread_create). To avoid this, do not strip the main executable. Use the -E linker switch to export the statically linked symbols to the dynamic symbol table of the main executable. Intel® VTune™ Amplifier XE 2011 Release Notes 13 For the list of symbols required for correct profiling, see the Analyzing Statically Linked Libraries topic in the online help. ? Hotspots, Concurrency and Locks and Waits analysis types may not work on executables that do not depend on the libpthread.so.0 library. (200208975) o There is currently a limitation in the product regarding profiling application targets where the executable does not depend on the libpthread.so.0 library. The message o Link libpthread.so to the application statically and restart profiling o appears when profiling an application where program image does not depend on libpthread.so.0 but then it dlopen()-s a shared library which does depend on libpthread.so.0. The collector is not able to follow the program execution and module load/unload so the collection results are likely to be misleading. o A workaround is to set "LD_PRELOAD=libpthread.so.0" before running the collection. ? VTune Amplifier XE collectors may crash on Red Hat Enterprise Linux x64 system while re-attaching to a process. (200212086) o VTune Amplifier XE may fail to collect data for Hotspots, Concurrency, and Locks and Waits analysis types if attempting to attach to a 64-bit process on RHEL6 system after detaching from the same process. ? Event-based profiling results may be incorrect if nmi_watchdog interrupt capability is enabled (200171859) o If the nmi_watchdog interrupt capability is enabled on a Linux system, eventbased profiling results may be incorrect. For example, when using a pauseresume scenario for event-based analysis on 64-bit Red Hat* Enterprise Linux* 6.1 with this feature enabled, no data will be collected after the collection is resumed. Before running event-based analysis on Linux systems, ensure that the nmi_watchdog interrupt capability, if available, is disabled. Disabling the nmi_watchdog interrupt is accomplished by adding the Linux kernel boot parameter 'nmi_watchdog=0' to your system boot loader and then rebooting the system.Intel® VTune™ Amplifier XE 2011 Release Notes 14 ? Information collected via ITT API is not available when attaching to a process. (200172007) o When collecting statistics data using ITT API injected into a source code like Frame Analysis or JIT-profiling, attaching to a process will not bring expected results. Use the VTune Amplifier XE analysis to start an application instead of attaching to a process. ? Do not use -ipo option - it causes the inline debug information to switch off (200260765) o If using the Intel® compiler to get performance data on inline functions, use the additional option “-inline-debug-info”, but avoid using the –ipo option. Currently this option disables generating the inline debug information in the compiler. ? Intel® Compiler currently doesn't support function split ranges in debug info which may lead to wrong performance data attribution in case function ranges are overlapped (e.g. performance data attributed to one function, but should have been split by two). (200260768) o In some cases the Intel® Compiler generates imprecise debug information about ranges of inline functions. This may lead to wrong performance data attribution when the Inline mode is turned on, for example: instead of two functions performance data is attributed just to one of them. 7 Attributions Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.Intel® VTune™ Amplifier XE 2011 Release Notes 15 "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of,Intel® VTune™ Amplifier XE 2011 Release Notes 16 publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditionsIntel® VTune™ Amplifier XE 2011 Release Notes 17 for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONSIntel® VTune™ Amplifier XE 2011 Release Notes 18 Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him.Intel® VTune™ Amplifier XE 2011 Release Notes 19 Libunwind Copyright (c) 2002 Hewlett-Packard Co. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except where otherwise noted in the source code (e.g. the files hash.c, list.c and the trio files, which are covered by a similar licence but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2Intel® VTune™ Amplifier XE 2011 Release Notes 20 1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the Individual or Organization ("Licensee") accessing and otherwise using this software ("Python") in source or binary form and its associated documentation. 2. Subject to the terms and conditions of this License Agreement, PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use Python alone or in any derivative version, provided, however, that PSF's License Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008 Python Software Foundation; All Rights Reserved" are retained in Python alone or in any derivative version prepared by Licensee. 3. In the event Licensee prepares a derivative work that is based on or incorporates Python or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to Python. 4. PSF is making Python available to Licensee on an "AS IS" basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT INFRINGE ANY THIRD PARTY RIGHTS. 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF. 6. This License Agreement will automatically terminate upon a material breach of its terms and conditions. 7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between PSF and Licensee. This License Agreement does not grant permission to use PSF trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party. 8. By copying, installing or otherwise using Python, Licensee agrees to be bound by the terms and conditions of this License Agreement. wxWidgets Library This product includes wxWindows software which can be downloaded from www.wxwidgets.org/downloads.Intel® VTune™ Amplifier XE 2011 Release Notes 21 wxWindows Library Licence, Version 3.1 ====================================== Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this licence document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into a copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly.Intel® VTune™ Amplifier XE 2011 Release Notes 22 /* zlib.h -- interface of the 'zlib' general purpose compression library version 1.2.3, July 18th, 2005 Copyright (C) 1995-2005 Jean-loup Gailly and Mark Adler This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly jloup@gzip.org Mark Adler madler@alumni.caltech.edu */ 8 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel® VTune™ Amplifier XE 2011 Release Notes 23 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. This document contains information on products in the design phase of development. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. Copyright (C) 2010-2011, Intel Corporation. All rights reserved. Intel® VTune™ Amplifier XE 2011 Release Notes 1 Intel® VTune™ Amplifier XE 2011 Release Notes for Windows* OS Installation Guide and Release Notes Document number: 323401-001US 2 November 2011 Contents: Introduction What’s New System Requirements Technical Support Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction The Intel® VTune™ Amplifier XE 2011 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures. This document provides system requirements, installation instructions, issues and limitations, and legal information. The Intel® VTune™ Amplifier XE 2011 has a standalone graphical user interface (GUI) as well as a command-line interface (CLI). To learn more about this product’s documentation, help, and samples, see the Intel® VTune™ Amplifier XE 2011 Documentation item in the Start menu program folder.Intel® VTune™ Amplifier XE 2011 Release Notes 2 2 What’s New The Intel® VTune™ Amplifier XE 2011 Update6 adds: ? Intel® Atom™ processors (code name Saltwell and Cedarview) support, including hardware event-based sampling analysis types and metrics for advanced tuning ? Bandwidth analysis for the 32nm Intel® processors code name Eagleton ? Inline functions support (controlled by a filter bar mode) ? “Tiny” threads timeline mode ? Bug fixes The Intel® VTune™ Amplifier XE 2011 Update5 adds: ? Project Explorer ? Bandwidth Analysis for the 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) ? Advanced options for analyzing child processes ? Command line reports with stacks ? Support for analysis of MPI programs that use the Intel® MPI library ? Usability improvements The Intel® VTune™ Amplifier XE 2011 Update4: ? Update 3 sometimes incorrectly presented CPU Time in the thread timeline for Hotspots and Concurrency analysis types. Different scales were used for different threads and, thereby, could confuse a user by presenting low CPU Time in one thread as the same height in the chart as high CPU Time in another thread. The values presented in the tool tip when hovering over the chart were still correct. Update 4 resolves this problem completely. The Intel® VTune™ Amplifier XE 2011 Update3: ? 32nm Westmere Family of Processors (codenamed Westmere-EX) support ? Pre-defined analysis for Intel® Atom™ Processor ? Comparison mode in Summary pane Intel® VTune™ Amplifier XE 2011 Release Notes 3 The Intel® VTune™ Amplifier XE 2011 Update2: ? The 2nd Generation Intel® Core™ Processor Family (codenamed Sandy Bridge) support including EBS based analysis types and metrics for advanced tuning ? Automatic highlighting and expansion in Bottom-Up and Top-Down panes ? Tooltips for metrics description in grid panes ? Ability to import tb5/6 files from GUI ? JIT API support for Hotspots, Concurrency, and Locks and Waits analysis types ? Overhead time metric calculation for native threading synchronization The Intel® VTune™ Amplifier XE 2011 Update1: ? Data export to CSV file format ? Source / assembly toggling button ? Several bugs were fixed. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ Processor requirements ? For general operations with user interface and all data collection except Hardware eventbased sampling analysis o A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor. o For the best experience, a multi-core or multi-processor system is recommended. o Because Intel® VTune™ Amplifier XE requires specific knowledge of assemblylevel instructions, its analysis may not operate correctly if a program contains non-Intel® instructions. In this case, run the analysis with a target executable that contains only Intel instructions. After you finish using VTune™ Amplifier XE you can use the assembler code or optimizing compiler options that provide the non-Intel instructions. ? For Hardware event-based sampling analysis (EBS)Intel® VTune™ Amplifier XE 2011 Release Notes 4 o EBS analysis makes use of the on chip Performance Monitoring Unit (PMU) and requires a genuine Intel processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below). o EBS analysis is not supported on the Intel® Pentium 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. o However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements. o EBS analysis requires a non-virtual machine to ensure access to the on-chip PMU. EBS is not supported within a virtual machine environment. ? The list of supported processors is constantly being extended. Here is a partial list of processors where the EBS analysis is enabled: Mobile processors Intel® Atom™ Processor Intel® Core™ i7 Mobile Processor Extreme Edition Intel® Core™ i7, i5, i3 Mobile Processors Intel® Core™2 Extreme Mobile Processor Intel® Core™2 Quad Mobile Processor Intel® Core™2 Duo Mobile Processor Intel® Core™ Duo Processor Intel® Core™ Solo Processor Intel® Pentium® Mobile Processor Desktop processors Intel® Atom™ Processor Intel® Core™ i7 Desktop Processor Extreme Edition Intel® Core™ i7, i5, i3 Desktop Processors Intel® Core™2 Quad Desktop Processor Intel® Core™2 Extreme Desktop Processor Intel® Core™2 Duo Desktop Processor Server and workstation processors Intel® Xeon® processors E7-8800/4800/2800 family Intel® Xeon® processors E3-1200 family Intel® Xeon® processors 65xx/75xx series Intel® Xeon® processors 36xx/56xx series Intel® Xeon® processors 35xx/55xx series Intel® Xeon® processors 34xx series Quad-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx series Dual-Core Intel® Xeon® processors 7xxx, 5xxx, and 3xxx seriesIntel® VTune™ Amplifier XE 2011 Release Notes 5 System Memory Requirements ? At least 2 GB of RAM Disk Space Requirements ? 650 MB free disk space required for all product features and all architectures Software Requirements ? Supported operational systems: o Microsoft* Windows XP* SP2 and SP3 o Microsoft* Windows XP Professional x64 Edition SP1 and SP2 o Microsoft* Windows Vista* (Ultimate) o Microsoft* Windows 7* SP1 o Microsoft* Windows Server 2008* o Embedded editions not supported NOTE: In a future major release of this product, support for installation and use on Microsoft Windows Vista will be removed. ? We support all OS distributions above. For your information VTune™ Amplifier XE was qualified on the systems listed below: o Microsoft* Windows XP* SP2 and SP3 o Microsoft* Windows Vista* (Ultimate) SP1 and SP2 o Microsoft* Windows Server 2008* and SP2 o Microsoft* Windows Server 2008* R2 o Microsoft* Windows 7* and SP1 ? Supported compilers: o Intel® C/C++ Compiler 11 and higher o Intel® Fortran Compiler 11 and higher o Intel Parallel Composer o Microsoft* Visual Studio* C/C++ Compiler ? Supported Microsoft Visual Studio versions: o Microsoft* Visual Studio* 2005 o Microsoft* Visual Studio* 2008 o Microsoft* Visual Studio* 2010 and SP1 NOTE: In a future major release of this product, support for installation and use with Microsoft Visual Studio 2005 will be removed. Intel recommends that customers migrate to Microsoft Visual Studio 2010* at their earliest convenience. ? Application coding requirements o Supported programming languages: ? Fortran ? C ? C++ ? C# (only .NET versions 4.0 and below are supported)Intel® VTune™ Amplifier XE 2011 Release Notes 6 o Concurrency and Locks and Waits analysis types interpret the use of constructs from the following threading methodologies: ? Intel® Threading Building Blocks ? Win32* Threads on Windows* ? OpenMP* ? Intel's C/C++ Parallel Language Extensions ? To view PDF documents, use a PDF reader, such as Adobe Reader*. 4 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support/ Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 5 Installation Notes If you are installing the product for the first time, please be sure to have the product serial number available so you can type it in during installation. A valid license is required for installation and use. The installation of VTune™ Amplifier XE removes any earlier installed version of VTune™ Amplifier XE. The product is a self-extracting executable archive with one IA-32 package you can install on either a 32-bit or 64-bit system. To begin installation, double click on VTune_Amplifier_XE_2011_update6_setup.exe file as a user with Administrative privileges. This installs the full package (includes GUI front-end for using the VTune™ Amplifier XE as well as Microsoft* Visual Studio integration). Activation is required. Installing Collectors on Remote Systems You can install the command line data collection features of the product on remote systems to reduce overhead and simply collect data remotely. Data collection on a remote system does not Intel® VTune™ Amplifier XE 2011 Release Notes 7 require a license; however, viewing of the data cannot be done on the remote system unless a license is present. The results of any data collection that is run on the remote system must then be copied to the system where the regular install was done for analysis, viewing, and reporting. To do this: 1. Unpack the product web image manually using the command: VTune_Amplifier_XE_2011_update6_setup.exe --extract-only --silent --extract-folder C:\temp\AmplXE_update6_unpacked Use any convenient path for the --extract-folder option. In case the --extract-folder option is omitted, the default location for the extracted image would be: "C:\Program Files (x86) \Intel\Download\VTune_Amplifier_XE_2011_update6_setup" for 64-bit and "C:\Program Files \Intel\Download\VTune_Amplifier_XE_2011_update6_setup" for 32-bit OS. 2. Copy the folder containing the installation files for the collectors and command line tools to the remote machine. With the example shown above, the location of this folder would be C:\temp\AmplXE_update6_unpacked\Installs\ps_he_cli.* 3. Run the Amplifier_XE.msi with Administrative privileges and follow the instructions. No activation will be required. 4. On 64-bit remote machine, from VTune™ Amplifier XE installation location, run and install msvcrt_x86.msi and msvcrt_x64.msi (requires Administrative privileges). 5. On 32-bit remote machine, from VTune™ Amplifier XE installation location, run and install msvcrt_x86.msi (requires Administrative privileges). Default Installation Folders The default top-level installation folder for this product is: ? C:\Program Files\Intel\VTune Amplifier XE 2011\ If you are installing on a system with a non-English language version of Windows, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (X86) or the equivalent. This product installs into an arrangement of folders shown in the diagram below. Not all folders will be present in a given installation. ? C:\Program Files\Intel\Amplifier XE 2011\Intel® VTune™ Amplifier XE 2011 Release Notes 8 o bin32 o bin64* o config o documentation o include o lib32 o lib64* o message o resources o sepdk o samples (*) bin64 and lib64 are available for Intel® 64 architecture install package How to activate your evaluation software after purchasing Users of evaluation versions of Intel Developer Products have a new tool that allows converting evaluation-licensed products to fully licensed products once the product is purchased and a serial number is obtained. The “Activation Tool” is a utility that allows users of evaluation products to enter a valid product Serial Number to convert the product to fully licensed status. Please click Start > All Programs > Intel Parallel Studio XE 2011 > Product Activation, supply a valid product serial number, and click Activate to convert your evaluation software to a fully licensed product. Changing, Updating and Removing the Product If you want to add or remove components from an installation, open the Control Panel and select the Add or Remove Programs applet, select “Intel® VTune™ Amplifier XE 2011” and click Change. To remove the product, select Remove instead of Change. When installing an updated version of the product, you do not need to remove the older version. Installation program will remove the old version automatically. Note: If the SEP driver uninstallation failed during the normal uninstall process, open a Command Prompt window and execute the following commands with Administrative privileges to manually remove the SEP driver from the system: cd %windir%\system32\drivers dir sep*.sys net stop sep3_4 # unload SEP3 driver from kernel del sep3_4.sys # delete SEP3 driver from filesystem net stop sepdal # unload PAX driver from kernel del sepdal.sys # delete PAX driver from filesystemIntel® VTune™ Amplifier XE 2011 Release Notes 9 6 Issues and Limitations Known Issues and Limitations ? Running time is attributed to a next instruction (200108041) o To collect the data about time-consuming running regions of the target, the VTune™ Amplifier XE interrupts executing target threads and attributes the time to the context IP address. o Due to the collection mechanism, the captured IP address points to the instruction occurred AFTER the one that is actually consuming most of the time. This leads to the running time attributed to next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line. o In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to the different function in tightly inlined code . ? Incorrect timing results when running on a 32-bit virtual machine (200137061) o Intel® Amplifier may fail to collect correct timing data when running on a virtual machine with problematic virtualization of time stamp counters. In this case Amplifier throws a warning message: o “Warning: Cannot load data file '.trace' (syncAcquiredHandler: timestamps aren't ascended!)” ? An application which allocates massive chunks of memory may fail to work under Amplifier (200083850) o If 32-bit application allocates massive chunks of memory (close to 2 GB) in the heap, it may fail to launch under Amplifier while running fine by its own. This happens because Amplifier requires additional memory in the profiled application process for doing the analysis. The workaround could be in using larger address space (e.g. converting the project to 64-bit).Intel® VTune™ Amplifier XE 2011 Release Notes 10 ? SEP may crash certain NHM systems when deep sleep states are enabled (200149603) o On some Intel® Core™ i7 processor-based systems with C-states enabled, sampling may cause system hanging due to a known hardware issue (see errata AAJ134 inhttp://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this, disable the “Cn(ACPI Cn) report to OS” BIOS option before sampling with the VTune Amplifier XE analyzer on Intel Core™ i7 processor-based systems. ? Link to instruction guide: instruction set reference document is not positioned on description of proper instruction. (200091200) o The reference information for assembly instructions can be opened in any PDF viewer, but only Adobe Acrobat Reader* supports positioning the instruction reference document on the required page. To ensure correct functionality of this feature, you are recommended to install the latest available version of Adobe Acrobat Reader. ? Uninstalling limitation: pin.exe stays running after detaching. (200092295) o The VTune™ Amplifier XE cannot be uninstalled after attaching to the target to be profiled until running the target is over. The cause is that pin.exe keeps working after detaching from the target and exits only after the profiled application/process execution finishes. ? Second attach to the same application should print an error and exit immediately. (200092650) o The VTune™ Amplifier XE allows running the analysis while the previous one is in progress but does not store any data for the second analysis run. ? Specifying too low "Sampling After Value" for some events may cause system hang due to frequent events triggering during the collection (200093394) o Use reasonable "Sampling After Value" that result in about 1000 events triggering per second. This is statistically sufficient for the data analysis. For more fine grained analysis of sampling results, decrease the "Sampling After Value" gradually observing the system responsiveness slowdown due to frequent interruptions.Intel® VTune™ Amplifier XE 2011 Release Notes 11 ? Event-based sampling collection cannot start if the result directory path contains non-English characters (200185851) o When you install the product on a system with language localization, make sure the path to the result directory does not contain non-English characters. ? Truncated .NET module names may be displayed in results view (200199458) o When viewing results collected for a .NET application you may observe truncated .NET module names. Please make sure a system was reboot after the .NET application install before profiling with Amplifier XE. ? VTune™ Amplifier XE may crash on the analysis of OpenMP enabled binaries compiled with a certain version of Intel Complier (200199671) o On Windows 7 64-bit based systems the Hotspot, Concurrency or Lock&Waits Analysis may crash during the analysis of 32-bit binaries compiled with the Intel Compiler v.12.0, also included in the Composer XE 2011 Update1, and enabled with the OpenMP. Applications that use 32-bit Intel IPP or MKL libraries and are re-compiled with the 12.0 compiler may be affected, as well. ? Intel® Compiler only produces first level of inlines. The nested inlines are not emitted into the debug information. (200164310) o Intel® Compiler currently generates debug information only for the first level of inline functions. So, you cannot see performance data attributed to functions inlined to other inline functions. Instead, this performance data are attributed to corresponding functions inlined to regular (not inline) functions. This may also cause wrong source line attribution of performance data in the source view. ? VTune™ Amplifier XE does not resolve symbols correctly on Windows XP SP1 operating system (200216358) o When VTune™ Amplifier XE is ran on Windows XP Service Pack 1 operating system, a problem may be observed that symbols are not resolved correctly but instead are shown as "[foo.dll]" names. This happens because VTune™ Amplifier XE uses Microsoft DIA library version which requires Service Pack 2 to be installed. Please install the service pack to resolve the issue.Intel® VTune™ Amplifier XE 2011 Release Notes 12 ? Information collected via ITT API is not available when attaching to a process. (200172007) o When collecting statistics data using ITT API injected into a source code like Frame Analysis or JIT-profiling, attaching to a process will not bring expected results. Use the VTune Amplifier XE analysis to start an application instead of attaching to a process. ? Do not use -ipo option - it causes the inline debug information to switch off (200260765) o If using the Intel® compiler to get performance data on inline functions, use the additional option “/debug:inline-debug-info”, but avoid using the –ipo (/Qipo on Windows) option. Currently this option disables generating the inline debug information in the compiler. Note that the Intel compiler integrated into the Microsoft Visual Studio* IDE uses the /Qipo by default in the Release configuration. ? Intel® Compiler currently doesn't support function split ranges in debug info which may lead to wrong performance data attribution in case function ranges are overlapped (e.g. performance data attributed to one function, but should have been split by two). (200260768) o In some cases the Intel® Compiler generates imprecise debug information about ranges of inline functions. This may lead to wrong performance data attribution when the Inline mode is turned on, for example: instead of two functions performance data is attributed just to one of them. 7 Attributions Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.Intel® VTune™ Amplifier XE 2011 Release Notes 13 "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.Intel® VTune™ Amplifier XE 2011 Release Notes 14 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construedIntel® VTune™ Amplifier XE 2011 Release Notes 15 as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.Intel® VTune™ Amplifier XE 2011 Release Notes 16 END OF TERMS AND CONDITIONS Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.Intel® VTune™ Amplifier XE 2011 Release Notes 17 Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. Libunwind Copyright (c) 2002 Hewlett-Packard Co. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except where otherwise noted in the source code (e.g. the files hash.c, list.c and the trio files, which are covered by a similar licence but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other deal-Intel® VTune™ Amplifier XE 2011 Release Notes 18 ings in this Software without prior written authorization from him. PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2 1. This LICENSE AGREEMENT is between the Python Software Foundation ("PSF"), and the Individual or Organization ("Licensee") accessing and otherwise using this software ("Python") in source or binary form and its associated documentation. 2. Subject to the terms and conditions of this License Agreement, PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use Python alone or in any derivative version, provided, however, that PSF's License Agreement and PSF's notice of copyright, i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008 Python Software Foundation; All Rights Reserved" are retained in Python alone or in any derivative version prepared by Licensee. 3. In the event Licensee prepares a derivative work that is based on or incorporates Python or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to Python. 4. PSF is making Python available to Licensee on an "AS IS" basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT INFRINGE ANY THIRD PARTY RIGHTS. 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF. 6. This License Agreement will automatically terminate upon a material breach of its terms and conditions. 7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between PSF and Licensee. This License Agreement does not grant permission to use PSF trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party. 8. By copying, installing or otherwise using Python, Licensee agrees to be bound by the terms and conditions of this License Agreement.Intel® VTune™ Amplifier XE 2011 Release Notes 19 Changes to standard library modules: ==================================== A brief summary of changes made to Python 2.5.2 source: - On Windows*, the code of import, zipimport, and execfile was modified to handle directories containing Unicode characters. wxWidgets Library This product includes wxWindows software which can be downloaded from www.wxwidgets.org/downloads. wxWindows Library Licence, Version 3.1 ====================================== Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this licence document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into aIntel® VTune™ Amplifier XE 2011 Release Notes 20 copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly. /* zlib.h -- interface of the 'zlib' general purpose compression library version 1.2.3, July 18th, 2005 Copyright (C) 1995-2005 Jean-loup Gailly and Mark Adler This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly jloup@gzip.org Mark Adler madler@alumni.caltech.edu */ 8 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE Intel® VTune™ Amplifier XE 2011 Release Notes 21 INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. This document contains information on products in the design phase of development. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.Intel® VTune™ Amplifier XE 2011 Release Notes 22 Copyright (C) 2010-2011, Intel Corporation. All rights reserved. Intel(R) Threading Building Blocks Reference Manual Document Number 315415-014US. World Wide Web: http://www.intel.comIntel(R) Threading Building Blocks ii 315415-014US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/#/en_US_01. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries.* Other names and brands may be claimed as the property of others. Copyright (C) 2005 - 2011, Intel Corporation. All rights reserved. Overview Reference Manual iii Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804Intel(R) Threading Building Blocks iv 315415-014US Revision History Document Number Revision Number Description Revision Date 315415- 014 1.27 Updated the Optimization Notice. 2011-Oct-27 315415- 013 1.26 Moved the flow graph from Appendix D to Section 6 and made a number of updates as it bcomes a fully supported feature. Moved concurrent_priority_queue from Appendix D to Section 5.7 as it becomes fully supported. Added serial subset, memory pools, and parallel_deterministic_reduce to Appendix D. Made other small corrections and additions. 2011-Aug-01 315415- 012 1.25 Moved task and task_group priorities from Appendix D to Section 111.3.8 and 11.6. Updated concurrent_priority_queue documentation in Section D.1 to reflect interface changes. Updated flow graph documentation in D.2 to reflect changes in the interface. Added run-time loader documentation as Section D.3. 2011-July-01 315415- 011 1.24 Fix incorrect cross-reference to Tutorial in Section 11.3.5.3. Clarify left to right properties of parallel_reduce. Add task_group_context syntax and description to parallel algorithms as needed. Add group and change_group method to task. Update description of task_group. Add task and task_group priorities to Community Preview Features as D.3. Add two examples to D.2 and describe body objects. Update overwrite_node, write_once_node and join_node. 2011-Feb-24 315415- 010 1.23 Added graph to Community Preview Features. 2010-Dec-10 315415- 009 1.22 Added Community Preview Features Appendix. 2010-Nov-04 315415- 008 1.21 Added constructor that accepts Finit for enumerable_thread_specific. Added operator= declarations for enumerable_thread_specific. Overview Reference Manual v Contents 1 Overview .........................................................................................................1 2 General Conventions .........................................................................................2 2.1 Notation................................................................................................2 2.2 Terminology ..........................................................................................3 2.2.1 Concept ...................................................................................3 2.2.2 Model ......................................................................................4 2.2.3 CopyConstructible .....................................................................4 2.3 Identifiers .............................................................................................4 2.3.1 Case........................................................................................5 2.3.2 Reserved Identifier Prefixes ........................................................5 2.4 Namespaces ..........................................................................................5 2.4.1 tbb Namespace .........................................................................5 2.4.2 tb::flow Namespace...................................................................5 2.4.3 tbb::interfacex Namespace .........................................................5 2.4.4 tbb::internal Namespace ............................................................5 2.4.5 tbb::deprecated Namespace .......................................................6 2.4.6 tbb::strict_ppl Namespace..........................................................6 2.4.7 std Namespace .........................................................................6 2.5 Thread Safety ........................................................................................7 3 Environment ....................................................................................................8 3.1 Version Information................................................................................8 3.1.1 Version Macros .........................................................................8 3.1.2 TBB_VERSION Environment Variable ............................................8 3.1.3 TBB_runtime_interface_version Function ......................................9 3.2 Enabling Debugging Features ...................................................................9 3.2.1 TBB_USE_ASSERT Macro..........................................................10 3.2.2 TBB_USE_THREADING_TOOLS Macro .........................................10 3.2.3 TBB_USE_PERFORMANCE_WARNINGS Macro ..............................11 3.3 Feature macros ....................................................................................11 3.3.1 TBB_DEPRECATED macro .........................................................11 3.3.2 TBB_USE_EXCEPTIONS macro...................................................11 3.3.3 TBB_USE_CAPTURED_EXCEPTION macro....................................12 4 Algorithms .....................................................................................................13 4.1 Splittable Concept ................................................................................13 4.1.1 split Class ..............................................................................14 4.2 Range Concept.....................................................................................14 4.2.1 blocked_range Template Class ......................................16 4.2.1.1 size_type.................................................................18 4.2.1.2 blocked_range( Value begin, Value end, size_t grainsize=1 ) ............................................................................19 4.2.1.3 blocked_range( blocked_range& range, split )...............19 4.2.1.4 size_type size() const................................................19 4.2.1.5 bool empty() const ...................................................20 4.2.1.6 size_type grainsize() const.........................................20 4.2.1.7 bool is_divisible() const .............................................20Intel(R) Threading Building Blocks vi 315415-014US 4.2.1.8 const_iterator begin() const .......................................20 4.2.1.9 const_iterator end() const..........................................20 4.2.2 blocked_range2d Template Class ...............................................21 4.2.2.1 row_range_type .......................................................23 4.2.2.2 col_range_type ........................................................23 4.2.2.3 blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize ) ....................24 4.2.2.4 blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end) .....................................................24 4.2.2.5 blocked_range2d ( blocked_range2d& range, split ) .................................24 4.2.2.6 bool empty() const ...................................................24 4.2.2.7 bool is_divisible() const .............................................25 4.2.2.8 const row_range_type& rows() const ...........................25 4.2.2.9 const col_range_type& cols() const .............................25 4.2.3 blocked_range3d Template Class ...............................................25 4.3 Partitioners .........................................................................................26 4.3.1 auto_partitioner Class ..............................................................27 4.3.1.1 auto_partitioner() .....................................................28 4.3.1.2 ~auto_partitioner()...................................................28 4.3.2 affinity_partitioner...................................................................28 4.3.2.1 affinity_partitioner()..................................................30 4.3.2.2 ~affinity_partitioner() ...............................................30 4.3.3 simple_partitioner Class ...........................................................30 4.3.3.1 simple_partitioner() ..................................................31 4.3.3.2 ~simple_partitioner() ................................................31 4.4 parallel_for Template Function ...............................................................31 4.5 parallel_reduce Template Function..........................................................36 4.6 parallel_scan Template Function .............................................................41 4.6.1 pre_scan_tag and final_scan_tag Classes....................................46 4.6.1.1 bool is_final_scan()...................................................46 4.7 parallel_do Template Function................................................................47 4.7.1 parallel_do_feeder class ................................................48 4.7.1.1 void add( const Item& item )......................................49 4.8 parallel_for_each Template Function .......................................................49 4.9 pipeline Class ......................................................................................50 4.9.1 pipeline() ...............................................................................51 4.9.2 ~pipeline() .............................................................................51 4.9.3 void add_filter( filter& f )..........................................................51 4.9.4 void run( size_t max_number_of_live_tokens[, task_group_context& group] ) .................................................................................52 4.9.5 void clear() ............................................................................52 4.9.6 filter Class..............................................................................52 4.9.6.1 filter( mode filter_mode )...........................................53 4.9.6.2 ~filter()...................................................................54 4.9.6.3 bool is_serial() const .................................................54 4.9.6.4 bool is_ordered() const..............................................54 4.9.6.5 virtual void* operator()( void * item )..........................54 4.9.6.6 virtual void finalize( void * item )................................54 4.9.7 thread_bound_filter Class .........................................................55Overview Reference Manual vii 4.9.7.1 thread_bound_filter(mode filter_mode)........................57 4.9.7.2 result_type try_process_item() ...................................57 4.9.7.3 result_type process_item() ........................................58 4.10 parallel_pipeline Function ......................................................................58 4.10.1 filter_t Template Class .............................................................60 4.10.1.1 filter_t() ..................................................................61 4.10.1.2 filter_t( const filter_t& rhs ) ..............................61 4.10.1.3 template filter_t( filter::mode mode, const Func& f ).........................................................61 4.10.1.4 void operator=( const filter_t& rhs ) ...................61 4.10.1.5 ~filter_t()................................................................61 4.10.1.6 void clear() ..............................................................61 4.10.1.7 template filter_t make_filter(filter::mode mode, const Func& f) ...........................................................................62 4.10.1.8 template filter_t operator& (const filter_t& left, const filter_t& right).................................................62 4.10.2 flow_control Class ...................................................................62 4.11 parallel_sort Template Function..............................................................63 4.12 parallel_invoke Template Function ..........................................................64 5 Containers .....................................................................................................67 5.1 Container Range Concept ......................................................................67 5.2 concurrent_unordered_map Template Class .............................................68 5.2.1 Construct, Destroy, Copy..........................................................72 5.2.1.1 explicit concurrent_unordered_map (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) ..........................72 5.2.1.2 template concurrent_unordered_map (InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())72 5.2.1.3 concurrent_unordered_map(const unordered_map& m) .72 5.2.1.4 concurrent_unordered_map(const Alloc& a).................72 5.2.1.5 concurrent_unordered_map(const unordered_map&, const Alloc& a) .................................................................72 5.2.1.6 ~concurrent_unordered_map()...................................73 5.2.1.7 concurrent_ unordered_map& operator=(const concurrent_unordered_map& m); ...............................73 5.2.1.8 allocator_type get_allocator() const; ...........................73 5.2.2 Size and capacity ....................................................................73 5.2.2.1 bool empty() const ...................................................73 5.2.2.2 size_type size() const................................................73 5.2.2.3 size_type max_size() const ........................................73 5.2.3 Iterators ................................................................................73 5.2.3.1 iterator begin().........................................................74 5.2.3.2 const_iterator begin() const .......................................74 5.2.3.3 iterator end() ...........................................................74 5.2.3.4 const_iterator end() const..........................................74 5.2.3.5 const_iterator cbegin() const ......................................74 5.2.3.6 const_iterator cend() const ........................................74Intel(R) Threading Building Blocks viii 315415-014US 5.2.4 Modifiers ................................................................................75 5.2.4.1 std::pair insert(const value_type& x) ....75 5.2.4.2 iterator insert(const_iterator hint, const value_type& x) .75 5.2.4.3 template void insert(InputIterator first, InputIterator last) .............................................75 5.2.4.4 iterator unsafe_erase(const_iterator position) ...............75 5.2.4.5 size_type unsafe_erase(const key_type& k) .................76 5.2.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) .......................................................................76 5.2.4.7 void clear() ..............................................................76 5.2.4.8 void swap(concurrent_unordered_map& m)..................76 5.2.5 Observers ..............................................................................76 5.2.5.1 hasher hash_function() const .....................................76 5.2.5.2 key_equal key_eq() const ..........................................76 5.2.6 Lookup ..................................................................................77 5.2.6.1 iterator find(const key_type& k) .................................77 5.2.6.2 const_iterator find(const key_type& k) const ................77 5.2.6.3 size_type count(const key_type& k) const ....................77 5.2.6.4 std::pair equal_range(const key_type& k)...........................................................................77 5.2.6.5 std::pair equal_range(const key_type& k) const ........................77 5.2.6.6 mapped_type& operator[](const key_type& k) ..............77 5.2.6.7 mapped_type& at( const key_type& k ) .......................78 5.2.6.8 const mapped_type& at(const key_type& k) const.........78 5.2.7 Parallel Iteration .....................................................................78 5.2.7.1 const_range_type range() const .................................78 5.2.7.2 range_type range()...................................................78 5.2.8 Bucket Interface......................................................................78 5.2.8.1 size_type unsafe_bucket_count() const........................79 5.2.8.2 size_type unsafe_max_bucket_count() const ................79 5.2.8.3 size_type unsafe_bucket_size(size_type n)...................79 5.2.8.4 size_type unsafe_bucket(const key_type& k) const........79 5.2.8.5 local_iterator unsafe_begin(size_type n) ......................79 5.2.8.6 const_local_iterator unsafe_begin(size_type n) const .....79 5.2.8.7 local_iterator unsafe_end(size_type n).........................79 5.2.8.8 const_local_iterator unsafe_end(size_type n) const .......80 5.2.8.9 const_local_iterator unsafe_cbegin(size_type n) const ...80 5.2.8.10 const_local_iterator unsafe_cend(size_type n) const ......80 5.2.9 Hash policy.............................................................................80 5.2.9.1 float load_factor() const ............................................80 5.2.9.2 float max_load_factor() const .....................................80 5.2.9.3 void max_load_factor(float z) .....................................80 5.2.9.4 void rehash(size_type n) ...........................................80 5.3 concurrent_unordered_set Template Class ...............................................81 5.3.1 Construct, Destroy, Copy..........................................................84 5.3.1.1 explicit concurrent_unordered_set (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) ..........................84 5.3.1.2 template concurrent_unordered_set (InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type())85Overview Reference Manual ix 5.3.1.3 concurrent_unordered_set(const unordered_set& m) .....85 5.3.1.4 concurrent_unordered_set(const Alloc& a)...................85 5.3.1.5 concurrent_unordered_set(const unordered_set&, const Alloc& a) .................................................................85 5.3.1.6 ~concurrent_unordered_set().....................................85 5.3.1.7 concurrent_ unordered_set& operator=(const concurrent_unordered_set& m); .................................85 5.3.1.8 allocator_type get_allocator() const; ...........................85 5.3.2 Size and capacity ....................................................................86 5.3.2.1 bool empty() const ...................................................86 5.3.2.2 size_type size() const................................................86 5.3.2.3 size_type max_size() const ........................................86 5.3.3 Iterators ................................................................................86 5.3.3.1 iterator begin().........................................................86 5.3.3.2 const_iterator begin() const .......................................87 5.3.3.3 iterator end() ...........................................................87 5.3.3.4 const_iterator end() const..........................................87 5.3.3.5 const_iterator cbegin() const ......................................87 5.3.3.6 const_iterator cend() const ........................................87 5.3.4 Modifiers ................................................................................87 5.3.4.1 std::pair insert(const value_type& x) ....87 5.3.4.2 iterator insert(const_iterator hint, const value_type& x) .88 5.3.4.3 template void insert(InputIterator first, InputIterator last) .............................................88 5.3.4.4 iterator unsafe_erase(const_iterator position) ...............88 5.3.4.5 size_type unsafe_erase(const key_type& k) .................88 5.3.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) .......................................................................89 5.3.4.7 void clear() ..............................................................89 5.3.4.8 void swap(concurrent_unordered_set& m) ...................89 5.3.5 Observers ..............................................................................89 5.3.5.1 hasher hash_function() const .....................................89 5.3.5.2 key_equal key_eq() const ..........................................89 5.3.6 Lookup ..................................................................................89 5.3.6.1 iterator find(const key_type& k) .................................89 5.3.6.2 const_iterator find(const key_type& k) const ................90 5.3.6.3 size_type count(const key_type& k) const ....................90 5.3.6.4 std::pair equal_range(const key_type& k)...........................................................................90 5.3.6.5 std::pair equal_range(const key_type& k) const ........................90 5.3.7 Parallel Iteration .....................................................................90 5.3.7.1 const_range_type range() const .................................90 5.3.7.2 range_type range()...................................................90 5.3.8 Bucket Interface......................................................................91 5.3.8.1 size_type unsafe_bucket_count() const........................91 5.3.8.2 size_type unsafe_max_bucket_count() const ................91 5.3.8.3 size_type unsafe_bucket_size(size_type n)...................91 5.3.8.4 size_type unsafe_bucket(const key_type& k) const........91 5.3.8.5 local_iterator unsafe_begin(size_type n) ......................91 5.3.8.6 const_local_iterator unsafe_begin(size_type n) const .....91 5.3.8.7 local_iterator unsafe_end(size_type n).........................92 5.3.8.8 const_local_iterator unsafe_end(size_type n) const .......92 5.3.8.9 const_local_iterator unsafe_cbegin(size_type n) const ...92 5.3.8.10 const_local_iterator unsafe_cend(size_type n) const ......92Intel(R) Threading Building Blocks x 315415-014US 5.3.9 Hash policy.............................................................................92 5.3.9.1 float load_factor() const ............................................92 5.3.9.2 float max_load_factor() const .....................................92 5.3.9.3 void max_load_factor(float z) .....................................92 5.3.9.4 void rehash(size_type n) ...........................................93 5.4 concurrent_hash_map Template Class.....................................................93 5.4.1 Whole Table Operations............................................................97 5.4.1.1 concurrent_hash_map( const allocator_type& a = allocator_type() ) ....................................................97 5.4.1.2 concurrent_hash_map( size_type n, const allocator_type& a = allocator_type() )................................................97 5.4.1.3 concurrent_hash_map( const concurrent_hash_map& table, const allocator_type& a = allocator_type() ) ........97 5.4.1.4 template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) .........97 5.4.1.5 ~concurrent_hash_map() ..........................................98 5.4.1.6 concurrent_hash_map& operator= ( concurrent_hash_map& source ).................................98 5.4.1.7 void swap( concurrent_hash_map& table ) ...................98 5.4.1.8 void rehash( size_type n=0 )......................................98 5.4.1.9 void clear() ..............................................................98 5.4.1.10 allocator_type get_allocator() const.............................99 5.4.2 Concurrent Access ...................................................................99 5.4.2.1 const_accessor.........................................................99 5.4.2.2 accessor ................................................................ 101 5.4.3 Concurrent Operations ........................................................... 102 5.4.3.1 size_type count( const Key& key ) const .................... 104 5.4.3.2 bool find( const_accessor& result, const Key& key ) const104 5.4.3.3 bool find( accessor& result, const Key& key ).............. 104 5.4.3.4 bool insert( const_accessor& result, const Key& key ) .. 104 5.4.3.5 bool insert( accessor& result, const Key& key ) ........... 105 5.4.3.6 bool insert( const_accessor& result, const value_type& value ) .................................................................. 105 5.4.3.7 bool insert( accessor& result, const value_type& value )105 5.4.3.8 bool insert( const value_type& value ) ....................... 105 5.4.3.9 template void insert( InputIterator first, InputIterator last ) ....................... 106 5.4.3.10 bool erase( const Key& key ) .................................... 106 5.4.3.11 bool erase( const_accessor& item_accessor ).............. 106 5.4.3.12 bool erase( accessor& item_accessor )....................... 107 5.4.4 Parallel Iteration ................................................................... 107 5.4.4.1 const_range_type range( size_t grainsize=1 ) const .... 107 5.4.4.2 range_type range( size_t grainsize=1 )...................... 107 5.4.5 Capacity .............................................................................. 108 5.4.5.1 size_type size() const.............................................. 108 5.4.5.2 bool empty() const ................................................. 108 5.4.5.3 size_type max_size() const ...................................... 108 5.4.5.4 size_type bucket_count() const ................................ 108 5.4.6 Iterators .............................................................................. 108 5.4.6.1 iterator begin()....................................................... 108 5.4.6.2 iterator end() ......................................................... 109 5.4.6.3 const_iterator begin() const ..................................... 109 5.4.6.4 const_iterator end() const........................................ 109Overview Reference Manual xi 5.4.6.5 std::pair equal_range( const Key& key ); ......................................................................... 109 5.4.6.6 std::pair equal_range( const Key& key ) const;........................................... 109 5.4.7 Global Functions.................................................................... 109 5.4.7.1 template bool operator==( const concurrent_hash_map& a, const concurrent_hash_map& b); ....................................................................... 110 5.4.7.2 template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); ..................................................................... 110 5.4.7.3 template void swap(concurrent_hash_map &a, concurrent_hash_map &b)110 5.4.8 tbb_hash_compare Class ........................................................ 110 5.5 concurrent_queue Template Class......................................................... 112 5.5.1 concurrent_queue( const Alloc& a = Alloc () )............................ 114 5.5.2 concurrent_queue( const concurrent_queue& src, const Alloc& a = Alloc() ) ............................................................................... 114 5.5.3 template concurrent_queue( InputIterator first, InputIterator last, const Alloc& a = Alloc() )....................... 114 5.5.4 ~concurrent_queue()............................................................. 114 5.5.5 void push( const T& source )................................................... 115 5.5.6 bool try_pop ( T& destination )................................................ 115 5.5.7 void clear() .......................................................................... 115 5.5.8 size_type unsafe_size() const.................................................. 115 5.5.9 bool empty() const ................................................................ 115 5.5.10 Alloc get_allocator() const ...................................................... 115 5.5.11 Iterators .............................................................................. 116 5.5.11.1 iterator unsafe_begin()............................................ 116 5.5.11.2 iterator unsafe_end() .............................................. 116 5.5.11.3 const_iterator unsafe_begin() const .......................... 117 5.5.11.4 const_iterator unsafe_end() const ............................. 117 5.6 concurrent_bounded_queue Template Class ........................................... 117 5.6.1 void push( const T& source )................................................... 119 5.6.2 void pop( T& destination ) ...................................................... 119 5.6.3 bool try_push( const T& source ) ............................................. 119 5.6.4 bool try_pop( T& destination )................................................. 120 5.6.5 size_type size() const ............................................................ 120 5.6.6 bool empty() const ................................................................ 120 5.6.7 size_type capacity() const ...................................................... 120 5.6.8 void set_capacity( size_type capacity ) ..................................... 120 5.7 concurrent_priority_queue Template Class ............................................. 121 5.7.1 concurrent_priority_queue(const allocator_type& a = allocator_type()) ................................................................... 123 5.7.2 concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type())....................................... 123Intel(R) Threading Building Blocks xii 315415-014US 5.7.3 concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type())............................... 123 5.7.4 concurrent_priority_queue (const concurrent_priority_queue& src, const allocator_type& a = allocator_type())............................... 123 5.7.5 concurrent_priority_queue& operator=(const concurrent_priority_queue& src).............................................. 123 5.7.6 ~concurrent_priority_queue() ................................................. 124 5.7.7 bool empty() const ................................................................ 124 5.7.8 size_type size() const ............................................................ 124 5.7.9 void push(const_reference elem) ............................................. 124 5.7.10 bool try_pop(reference elem) .................................................. 124 5.7.11 void clear() .......................................................................... 125 5.7.12 void swap(concurrent_priority_queue& other) ........................... 125 5.7.13 allocator_type get_allocator() const ......................................... 125 5.8 concurrent_vector .............................................................................. 125 5.8.1 Construction, Copy, and Assignment ........................................ 130 5.8.1.1 concurrent_vector( const allocator_type& a = allocator_type() ) ................................................... 130 5.8.1.2 concurrent_vector( size_type n, const_reference t=T(), const allocator_type& a = allocator_type() );.............. 130 5.8.1.3 template concurrent_vector( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) ....................... 130 5.8.1.4 concurrent_vector( const concurrent_vector& src ) ...... 131 5.8.1.5 concurrent_vector& operator=( const concurrent_vector& src ) ..................................................................... 131 5.8.1.6 template concurrent_vector& operator=( const concurrent_vector& src )....................... 131 5.8.1.7 void assign( size_type n, const_reference t ) .............. 131 5.8.1.8 template void assign( InputIterator first, InputIterator last ) .......................................... 131 5.8.2 Whole Vector Operations ........................................................ 131 5.8.2.1 void reserve( size_type n )....................................... 132 5.8.2.2 void shrink_to_fit() ................................................. 132 5.8.2.3 void swap( concurrent_vector& x ) ............................ 132 5.8.2.4 void clear() ............................................................ 132 5.8.2.5 ~concurrent_vector() .............................................. 132 5.8.3 Concurrent Growth ................................................................ 133 5.8.3.1 iterator grow_by( size_type delta, const_reference t=T() )133 5.8.3.2 iterator grow_to_at_least( size_type n )..................... 133 5.8.3.3 iterator push_back( const_reference value ) ............... 134 5.8.4 Access ................................................................................. 134 5.8.4.1 reference operator[]( size_type index ) ...................... 134 5.8.4.2 const_refrence operator[]( size_type index ) const ...... 134 5.8.4.3 reference at( size_type index ) ................................. 134 5.8.4.4 const_reference at( size_type index ) const ................ 135 5.8.4.5 reference front()..................................................... 135 5.8.4.6 const_reference front() const ................................... 135 5.8.4.7 reference back() ..................................................... 135 5.8.4.8 const_reference back() const.................................... 135 5.8.5 Parallel Iteration ................................................................... 135 5.8.5.1 range_type range( size_t grainsize=1 )...................... 135 5.8.5.2 const_range_type range( size_t grainsize=1 ) const .... 136 5.8.6 Capacity .............................................................................. 136 5.8.6.1 size_type size() const.............................................. 136Overview Reference Manual xiii 5.8.6.2 bool empty() const ................................................. 136 5.8.6.3 size_type capacity() const........................................ 136 5.8.6.4 size_type max_size() const ...................................... 136 5.8.7 Iterators .............................................................................. 136 5.8.7.1 iterator begin()....................................................... 137 5.8.7.2 const_iterator begin() const ..................................... 137 5.8.7.3 iterator end() ......................................................... 137 5.8.7.4 const_iterator end() const........................................ 137 5.8.7.5 reverse_iterator rbegin() ......................................... 137 5.8.7.6 const_reverse_iterator rbegin() const ........................ 137 5.8.7.7 iterator rend()........................................................ 137 5.8.7.8 const_reverse_iterator rend()................................... 137 6 Flow Graph .................................................................................................. 138 6.1 graph Class ....................................................................................... 144 6.1.1 graph() ................................................................................ 145 6.1.2 ~graph().............................................................................. 145 6.1.3 void increment_wait_count()................................................... 145 6.1.4 void decrement_wait_count().................................................. 146 6.1.5 template< typename Receiver, typename Body > void run( Receiver &r, Body body ) .................................................................... 146 6.1.6 template< typename Body > void run( Body body ) ................... 146 6.1.7 void wait_for_all() ................................................................. 146 6.1.8 task *root_task() .................................................................. 147 6.2 sender Template Class ........................................................................ 147 6.2.1 ~sender() ............................................................................ 148 6.2.2 bool register_successor( successor_type & r ) = 0...................... 148 6.2.3 bool remove_successor( successor_type & r ) = 0...................... 148 6.2.4 bool try_get( output_type & ) ................................................. 148 6.2.5 bool try_reserve( output_type & )............................................ 149 6.2.6 bool try_release( )................................................................. 149 6.2.7 bool try_consume( ) .............................................................. 149 6.3 receiver Template Class ...................................................................... 149 6.3.1 ~receiver()........................................................................... 150 6.3.2 bool register_predecessor( predecessor_type & p ) .................... 150 6.3.3 bool remove_predecessor( predecessor_type & p )..................... 151 6.3.4 bool try_put( const input_type &v ) = 0.................................... 151 6.4 continue_msg Class ............................................................................ 151 6.5 continue_receiver Class....................................................................... 151 6.5.1 continue_receiver( int num_predecessors = 0 ) ......................... 152 6.5.2 continue_receiver( const continue_receiver& src )...................... 153 6.5.3 ~continue_receiver( ) ............................................................ 153 6.5.4 bool try_put( const input_type & ) ........................................... 153 6.5.5 bool register_predecessor( predecessor_type & r ) ..................... 153 6.5.6 bool remove_predecessor( predecessor_type & r ) ..................... 154 6.5.7 void execute() = 0 ................................................................ 154 6.6 graph_node Class............................................................................... 154 6.7 continue_node Template Class ............................................................. 155 6.7.1 template< typename Body> continue_node(graph &g, Body body)157 6.7.2 template< typename Body> continue_node(graph &g, int number_of_predecessors, Body body) ...................................... 157 6.7.3 continue_node( const continue_node & src ) ............................. 157 6.7.4 bool register_predecessor( predecessor_type & r ) ..................... 158 6.7.5 bool remove_predecessor( predecessor_type & r ) ..................... 158Intel(R) Threading Building Blocks xiv 315415-014US 6.7.6 bool try_put( const input_type & ) .......................................... 158 6.7.7 bool register_successor( successor_type & r )............................ 159 6.7.8 bool remove_successor( successor_type & r )............................ 159 6.7.9 bool try_get( output_type &v ) ................................................ 159 6.7.10 bool try_reserve( output_type & )............................................ 159 6.7.11 bool try_release( )................................................................. 160 6.7.12 bool try_consume( ) .............................................................. 160 6.8 function_node Template Class .............................................................. 160 6.8.1 template< typename Body> function_node(graph &g, size_t concurrency, Body body) ........................................................ 163 6.8.2 function_node( const function_node &src )................................ 163 6.8.3 bool register_predecessor( predecessor_type & p ) .................... 164 6.8.4 bool remove_predecessor( predecessor_type & p )..................... 164 6.8.5 bool try_put( const input_type &v )......................................... 164 6.8.6 bool register_successor( successor_type & r )............................ 164 6.8.7 bool remove_successor( successor_type & r )............................ 165 6.8.8 bool try_get( output_type &v ) ................................................ 165 6.8.9 bool try_reserve( output_type & )............................................ 165 6.8.10 bool try_release( )................................................................. 165 6.8.11 bool try_consume( ) .............................................................. 166 6.9 source_node Class.............................................................................. 166 6.9.1 template< typename Body> source_node(graph &g, Body body, bool is_active=true) ..................................................................... 168 6.9.2 source_node( const source_node &src ).................................... 168 6.9.3 bool register_successor( successor_type & r )............................ 168 6.9.4 bool remove_successor( successor_type & r )............................ 169 6.9.5 bool try_get( output_type &v ) ................................................ 169 6.9.6 bool try_reserve( output_type &v ) .......................................... 169 6.9.7 bool try_release( )................................................................. 169 6.9.8 bool try_consume( ) .............................................................. 170 6.10 overwrite_node Template Class ............................................................ 170 6.10.1 overwrite_node() .................................................................. 171 6.10.2 overwrite_node( const overwrite_node &src ) ............................ 171 6.10.3 ~overwrite_node() ................................................................ 172 6.10.4 bool register_predecessor( predecessor_type & ) ....................... 172 6.10.5 bool remove_predecessor( predecessor_type &) ........................ 172 6.10.6 bool try_put( const input_type &v ) ......................................... 172 6.10.7 bool register_successor( successor_type & r )............................ 173 6.10.8 bool remove_successor( successor_type & r )............................ 173 6.10.9 bool try_get( output_type &v ) ................................................ 173 6.10.10 bool try_reserve( output_type & )............................................ 173 6.10.11 bool try_release( )................................................................. 174 6.10.12 bool try_consume( ) .............................................................. 174 6.10.13 bool is_valid()....................................................................... 174 6.10.14 void clear() .......................................................................... 174 6.11 write_once_node Template Class .......................................................... 174 6.11.1 write_once_node() ................................................................ 176 6.11.2 write_once_node( const write_once_node &src )........................ 176 6.11.3 bool register_predecessor( predecessor_type & ) ....................... 176 6.11.4 bool remove_predecessor( predecessor_type &) ........................ 176 6.11.5 bool try_put( const input_type &v ) ......................................... 176 6.11.6 bool register_successor( successor_type & r )............................ 177 6.11.7 bool remove_successor( successor_type & r )............................ 177 6.11.8 bool try_get( output_type &v ) ................................................ 177Overview Reference Manual xv 6.11.9 bool try_reserve( output_type & )............................................ 177 6.11.10 bool try_release( )................................................................. 178 6.11.11 bool try_consume( ) .............................................................. 178 6.11.12 bool is_valid()....................................................................... 178 6.11.13 void clear() .......................................................................... 178 6.12 broadcast_node Template Class............................................................ 178 6.12.1 broadcast_node() .................................................................. 180 6.12.2 broadcast_node( const broadcast_node &src ) ........................... 180 6.12.3 bool register_predecessor( predecessor_type & ) ....................... 180 6.12.4 bool remove_predecessor( predecessor_type &) ........................ 180 6.12.5 bool try_put( const input_type &v ) ......................................... 181 6.12.6 bool register_successor( successor_type & r )............................ 181 6.12.7 bool remove_successor( successor_type & r )............................ 181 6.12.8 bool try_get( output_type & ) ................................................. 181 6.12.9 bool try_reserve( output_type & )............................................ 182 6.12.10 bool try_release( )................................................................. 182 6.12.11 bool try_consume( ) .............................................................. 182 6.13 buffer_node Class............................................................................... 182 6.13.1 buffer_node( graph& g )......................................................... 184 6.13.2 buffer_node( const buffer_node &src )..................................... 184 6.13.3 bool register_predecessor( predecessor_type & ) ....................... 184 6.13.4 bool remove_predecessor( predecessor_type &) ........................ 184 6.13.5 bool try_put( const input_type &v ) ......................................... 184 6.13.6 bool register_successor( successor_type & r )............................ 185 6.13.7 bool remove_successor( successor_type & r )............................ 185 6.13.8 bool try_get( output_type & v ) ............................................... 185 6.13.9 bool try_reserve( output_type & v ) ......................................... 185 6.13.10 bool try_release( )................................................................. 186 6.13.11 bool try_consume( ) .............................................................. 186 6.14 queue_node Template Class................................................................. 186 6.14.1 queue_node( graph& g ) ........................................................ 188 6.14.2 queue_node( const queue_node &src ) .................................... 188 6.14.3 bool register_predecessor( predecessor_type & ) ....................... 188 6.14.4 bool remove_predecessor( predecessor_type &) ........................ 188 6.14.5 bool try_put( const input_type &v ) ......................................... 188 6.14.6 bool register_successor( successor_type & r )............................ 189 6.14.7 bool remove_successor( successor_type & r )............................ 189 6.14.8 bool try_get( output_type & v ) ............................................... 189 6.14.9 bool try_reserve( output_type & v ) ......................................... 189 6.14.10 bool try_release( )................................................................. 190 6.14.11 bool try_consume( ) .............................................................. 190 6.15 priority_queue_node Template Class ..................................................... 190 6.15.1 priority_queue_node( graph& g).............................................. 192 6.15.2 priority_queue_node( const priority_queue_node &src )............. 192 6.15.3 bool register_predecessor( predecessor_type & ) ....................... 192 6.15.4 bool remove_predecessor( predecessor_type &) ........................ 193 6.15.5 bool try_put( const input_type &v ) ......................................... 193 6.15.6 bool register_successor( successor_type &r ) ............................ 193 6.15.7 bool remove_successor( successor_type &r )............................. 193 6.15.8 bool try_get( output_type & v ) ............................................... 194 6.15.9 bool try_reserve( output_type & v ) ......................................... 194 6.15.10 bool try_release( )................................................................. 194 6.15.11 bool try_consume( ) .............................................................. 194 6.16 sequencer_node Template Class ........................................................... 195Intel(R) Threading Building Blocks xvi 315415-014US 6.16.1 template sequencer_node( graph& g, const Sequencer& s ) ..................................................................... 197 6.16.2 sequencer_node( const sequencer_node &src ).......................... 197 6.16.3 bool register_predecessor( predecessor_type & ) ....................... 197 6.16.4 bool remove_predecessor( predecessor_type &) ........................ 197 6.16.5 bool try_put( input_type v ).................................................... 198 6.16.6 bool register_successor( successor_type &r ) ............................ 198 6.16.7 bool remove_successor( successor_type &r )............................. 198 6.16.8 bool try_get( output_type & v ) ............................................... 198 6.16.9 bool try_reserve( output_type & v ) ......................................... 199 6.16.10 bool try_release( )................................................................. 199 6.16.11 bool try_consume( ) .............................................................. 199 6.17 limiter_node Template Class ................................................................ 199 6.17.1 limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors ) ..................................... 201 6.17.2 limiter_node( const limiter_node &src ) .................................... 201 6.17.3 bool register_predecessor( predecessor_type& p ) ..................... 202 6.17.4 bool remove_predecessor( predecessor_type & r ) ..................... 202 6.17.5 bool try_put( input_type &v ).................................................. 202 6.17.6 bool register_successor( successor_type & r )............................ 203 6.17.7 bool remove_successor( successor_type & r )............................ 203 6.17.8 bool try_get( output_type & ) ................................................. 203 6.17.9 bool try_reserve( output_type & )............................................ 203 6.17.10 bool try_release( )................................................................. 204 6.17.11 bool try_consume( ) .............................................................. 204 6.18 join_node Template Class .................................................................... 204 6.18.1 join_node( graph &g )............................................................ 207 6.18.2 template < typename B0, typename B1, … > join_node( graph &g, B0 b0, B1 b1, … ) .................................................................. 208 6.18.3 join_node( const join_node &src )............................................ 208 6.18.4 input_ports_tuple_type& inputs() ............................................ 208 6.18.5 bool register_successor( successor_type & r )............................ 208 6.18.6 bool remove_successor( successor_type & r )............................ 209 6.18.7 bool try_get( output_type &v ) ................................................ 209 6.18.8 bool try_reserve( T & )........................................................... 209 6.18.9 bool try_release( )................................................................. 209 6.18.10 bool try_consume( ) .............................................................. 210 6.18.11 template typename std::tuple_element::type &input_port(JNT &jn).......... 210 6.19 input_port Template Function............................................................... 210 6.20 make_edge Template Function ............................................................. 211 6.21 remove_edge Template Function .......................................................... 211 6.22 copy_body Template Function .............................................................. 211 7 Thread Local Storage..................................................................................... 212 7.1 combinable Template Class.................................................................. 212 7.1.1 combinable() ........................................................................ 213 7.1.2 template combinable(FInit finit) .................... 213 7.1.3 combinable( const combinable& other ); ................................... 213 7.1.4 ~combinable() ...................................................................... 214 7.1.5 combinable& operator=( const combinable& other ) ................... 214 7.1.6 void clear() .......................................................................... 214 7.1.7 T& local() ............................................................................. 214Overview Reference Manual xvii 7.1.8 T& local( bool& exists ) .......................................................... 214 7.1.9 templateT combine(FCombine fcombine).. 215 7.1.10 template void combine_each(Func f) .............. 215 7.2 enumerable_thread_specific Template Class........................................... 215 7.2.1 Whole Container Operations.................................................... 219 7.2.1.1 enumerable_thread_specific() .................................. 219 7.2.1.2 enumerable_thread_specific(const enumerable_thread_specific &other).......................... 219 7.2.1.3 template enumerable_thread_specific( const enumerable_thread_specific& other ) .......................................................................... 220 7.2.1.4 template< typename Finit> enumerable_thread_specific(Finit finit) ...................... 220 7.2.1.5 enumerable_thread_specific(const &exemplar) ........... 220 7.2.1.6 ~enumerable_thread_specific() ................................ 220 7.2.1.7 enumerable_thread_specific& operator=(const enumerable_thread_specific& other); ........................ 220 7.2.1.8 template< typename U, typename Alloc, ets_key_usage_type Cachetype> enumerable_thread_specific& operator=(const enumerable_thread_specific& other); .................................................................. 221 7.2.1.9 void clear() ............................................................ 221 7.2.2 Concurrent Operations ........................................................... 221 7.2.2.1 reference local() ..................................................... 221 7.2.2.2 reference local( bool& exists )................................... 221 7.2.2.3 size_type size() const.............................................. 222 7.2.2.4 bool empty() const ................................................. 222 7.2.3 Combining............................................................................ 222 7.2.3.1 templateT combine(FCombine fcombine) .............................................................. 222 7.2.3.2 template void combine_each(Func f) 222 7.2.4 Parallel Iteration ................................................................... 223 7.2.4.1 const_range_type range( size_t grainsize=1 ) const .... 223 7.2.4.2 range_type range( size_t grainsize=1 )...................... 223 7.2.5 Iterators .............................................................................. 223 7.2.5.1 iterator begin()....................................................... 223 7.2.5.2 iterator end() ......................................................... 223 7.2.5.3 const_iterator begin() const ..................................... 223 7.2.5.4 const_iterator end() const........................................ 224 7.3 flattened2d Template Class.................................................................. 224 7.3.1 Whole Container Operations.................................................... 226 7.3.1.1 flattened2d( const Container& c ).............................. 227 7.3.1.2 flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last )................................. 227 7.3.2 Concurrent Operations ........................................................... 227 7.3.2.1 size_type size() const.............................................. 227 7.3.3 Iterators .............................................................................. 227 7.3.3.1 iterator begin()....................................................... 227 7.3.3.2 iterator end() ......................................................... 227 7.3.3.3 const_iterator begin() const ..................................... 228Intel(R) Threading Building Blocks xviii 315415-014US 7.3.3.4 const_iterator end() const........................................ 228 7.3.4 Utility Functions .................................................................... 228 8 Memory Allocation......................................................................................... 229 8.1 Allocator Concept ............................................................................... 229 8.2 tbb_allocator Template Class ............................................................... 230 8.3 scalable_allocator Template Class ......................................................... 230 8.3.1 C Interface to Scalable Allocator .............................................. 231 8.3.1.1 size_t scalable_msize( void* ptr ) ............................. 233 8.4 cache_aligned_allocator Template Class ................................................ 233 8.4.1 pointer allocate( size_type n, const void* hint=0 ) ..................... 235 8.4.2 void deallocate( pointer p, size_type n ) ................................... 235 8.4.3 char* _Charalloc( size_type size )............................................ 236 8.5 zero_allocator .................................................................................... 236 8.6 aligned_space Template Class .............................................................. 237 8.6.1 aligned_space() .................................................................... 238 8.6.2 ~aligned_space() .................................................................. 238 8.6.3 T* begin() ............................................................................ 238 8.6.4 T* end() .............................................................................. 238 9 Synchronization............................................................................................ 239 9.1 Mutexes ............................................................................................ 239 9.1.1 Mutex Concept ...................................................................... 239 9.1.1.1 C++ 200x Compatibility .......................................... 240 9.1.2 mutex Class ......................................................................... 241 9.1.3 recursive_mutex Class ........................................................... 242 9.1.4 spin_mutex Class .................................................................. 242 9.1.5 queuing_mutex Class............................................................. 243 9.1.6 ReaderWriterMutex Concept.................................................... 243 9.1.6.1 ReaderWriterMutex()............................................... 245 9.1.6.2 ~ReaderWriterMutex() ............................................ 245 9.1.6.3 ReaderWriterMutex::scoped_lock()............................ 245 9.1.6.4 ReaderWriterMutex::scoped_lock( ReaderWriterMutex& rw, bool write =true)............................................... 245 9.1.6.5 ReaderWriterMutex::~scoped_lock() ......................... 245 9.1.6.6 void ReaderWriterMutex:: scoped_lock:: acquire( ReaderWriterMutex& rw, bool write=true ) ................ 245 9.1.6.7 bool ReaderWriterMutex:: scoped_lock::try_acquire( ReaderWriterMutex& rw, bool write=true ) ................ 246 9.1.6.8 void ReaderWriterMutex:: scoped_lock::release()........ 246 9.1.6.9 bool ReaderWriterMutex:: scoped_lock::upgrade_to_writer()............................. 246 9.1.6.10 bool ReaderWriterMutex:: scoped_lock::downgrade_to_reader()........................ 246 9.1.7 spin_rw_mutex Class ............................................................. 247 9.1.8 queuing_rw_mutex Class........................................................ 247 9.1.9 null_mutex Class................................................................... 248 9.1.10 null_rw_mutex Class.............................................................. 248 9.2 atomic Template Class ........................................................................ 249 9.2.1 memory_semantics Enum....................................................... 251 9.2.2 value_type fetch_and_add( value_type addend ) ....................... 251 9.2.3 value_type fetch_and_increment()........................................... 252 9.2.4 value_type fetch_and_decrement().......................................... 252 9.2.5 value_type compare_and_swap............................................... 252Overview Reference Manual xix 9.2.6 value_type fetch_and_store( value_type new_value )................. 252 9.3 PPL Compatibility ............................................................................... 253 9.3.1 critical_section...................................................................... 253 9.3.2 reader_writer_lock Class ........................................................ 254 9.4 C++ 200x Synchronization .................................................................. 255 10 Timing......................................................................................................... 259 10.1 tick_count Class ................................................................................. 259 10.1.1 static tick_count tick_count::now() .......................................... 260 10.1.2 tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ) .................................................................... 260 10.1.3 tick_count::interval_t Class .................................................... 260 10.1.3.1 interval_t() ............................................................ 261 10.1.3.2 interval_t( double sec ) ........................................... 261 10.1.3.3 double seconds() const ............................................ 261 10.1.3.4 interval_t operator+=( const interval_t& i ) ................ 261 10.1.3.5 interval_t operator-=( const interval_t& i )................. 262 10.1.3.6 interval_t operator+ ( const interval_t& i, const interval_t& j ) ........................................................ 262 10.1.3.7 interval_t operator- ( const interval_t& i, const interval_t& j ) ........................................................................ 262 11 Task Groups................................................................................................. 263 11.1 task_group Class................................................................................ 264 11.1.1 task_group() ........................................................................ 265 11.1.2 ~task_group() ...................................................................... 265 11.1.3 template void run( const Func& f ) ................. 265 11.1.4 template void run ( task_handle& handle );........................................................................................ 265 11.1.5 template void run_and_wait( const Func& f ) ... 265 11.1.6 template void run _and_wait( task_handle& handle ); ............................................... 266 11.1.7 task_group_status wait()........................................................ 266 11.1.8 bool is_canceling() ................................................................ 266 11.1.9 void cancel() ........................................................................ 266 11.2 task_group_status Enum..................................................................... 266 11.3 task_handle Template Class................................................................. 267 11.4 make_task Template Function.............................................................. 267 11.5 structured_task_group Class ................................................................ 268 11.6 is_current_task_group_canceling Function ............................................. 269 12 Task Scheduler ............................................................................................. 270 12.1 Scheduling Algorithm.......................................................................... 271 12.2 task_scheduler_init Class .................................................................... 272 12.2.1 task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ) ........................................................... 274 12.2.2 ~task_scheduler_init() ........................................................... 275 12.2.3 void initialize( int max_threads=automatic ).............................. 276 12.2.4 void terminate().................................................................... 276 12.2.5 int default_num_threads() ...................................................... 276 12.2.6 bool is_active() const............................................................. 276 12.2.7 Mixing with OpenMP............................................................... 276 12.3 task Class ......................................................................................... 277 12.3.1 task Derivation ..................................................................... 281Intel(R) Threading Building Blocks xx 315415-014US 12.3.1.1 Processing of execute() ........................................... 281 12.3.2 task Allocation ...................................................................... 281 12.3.2.1 new( task::allocate_root( task_group_context& group ) ) T282 12.3.2.2 new( task::allocate_root() ) T .................................. 282 12.3.2.3 new( x.allocate_continuation() ) T............................. 282 12.3.2.4 new( x.allocate_child() ) T ....................................... 283 12.3.2.5 new(task::allocate_additional_child_of( y )) T............. 283 12.3.3 Explicit task Destruction ......................................................... 284 12.3.3.1 static void destroy ( task& victim ) ............................ 284 12.3.4 Recycling Tasks..................................................................... 285 12.3.4.1 void recycle_as_continuation() ................................. 285 12.3.4.2 void recycle_as_safe_continuation() .......................... 286 12.3.4.3 void recycle_as_child_of( task& new_successor ) ........ 286 12.3.5 Synchronization .................................................................... 286 12.3.5.1 void set_ref_count( int count ) ................................. 287 12.3.5.2 void increment_ref_count();..................................... 287 12.3.5.3 int decrement_ref_count(); ...................................... 287 12.3.5.4 void wait_for_all() .................................................. 288 12.3.5.5 static void spawn( task& t )...................................... 289 12.3.5.6 static void spawn ( task_list& list ) ............................ 289 12.3.5.7 void spawn_and_wait_for_all( task& t ) ..................... 289 12.3.5.8 void spawn_and_wait_for_all( task_list& list )............. 290 12.3.5.9 static void spawn_root_and_wait( task& root )............ 290 12.3.5.10 static void spawn_root_and_wait( task_list& root_list ) 290 12.3.5.11 static void enqueue ( task& ).................................... 291 12.3.6 task Context ......................................................................... 291 12.3.6.1 static task& self() ................................................... 291 12.3.6.2 task* parent() const................................................ 291 12.3.6.3 void set_parent(task* p).......................................... 292 12.3.6.4 bool is_stolen_task() const....................................... 292 12.3.6.5 task_group_context* group() ................................... 292 12.3.6.6 void change_group( task_group_context& ctx )........... 292 12.3.7 Cancellation.......................................................................... 292 12.3.7.1 bool cancel_group_execution() ................................. 292 12.3.7.2 bool is_cancelled() const.......................................... 293 12.3.8 Priorities .............................................................................. 293 12.3.8.1 void enqueue ( task& t, priority_t p ) const................. 294 12.3.8.2 void set_group_priority ( priority_t ) ......................... 294 12.3.8.3 priority_t group_priority () const............................... 294 12.3.9 Affinity................................................................................. 294 12.3.9.1 affinity_id .............................................................. 295 12.3.9.2 virtual void note_affinity ( affinity_id id ).................... 295 12.3.9.3 void set_affinity( affinity_id id ) ................................ 295 12.3.9.4 affinity_id affinity() const......................................... 295 12.3.10 task Debugging..................................................................... 295 12.3.10.1 state_type state() const .......................................... 296 12.3.10.2 int ref_count() const ............................................... 297 12.4 empty_task Class ............................................................................... 298 12.5 task_list Class.................................................................................... 298 12.5.1 task_list() ............................................................................ 299 12.5.2 ~task_list() .......................................................................... 299 12.5.3 bool empty() const ................................................................ 299 12.5.4 push_back( task& task )......................................................... 299 12.5.5 task& task pop_front() ........................................................... 300Overview Reference Manual xxi 12.5.6 void clear() .......................................................................... 300 12.6 task_group_context ............................................................................ 300 12.6.1 task_group_context( kind_t relation_to_parent=bound, uintptr_t traits=default_traits ) ............................................................ 302 12.6.2 ~task_group_context() .......................................................... 302 12.6.3 bool cancel_group_execution()................................................ 302 12.6.4 bool is_group_execution_cancelled() const................................ 302 12.6.5 void reset() .......................................................................... 303 12.6.6 void set_priority ( priority_t ).................................................. 303 12.6.7 priority_t priority () const ....................................................... 303 12.7 task_scheduler_observer ..................................................................... 303 12.7.1 task_scheduler_observer() ..................................................... 304 12.7.2 ~task_scheduler_observer() ................................................... 304 12.7.3 void observe( bool state=true ) ............................................... 304 12.7.4 bool is_observing() const........................................................ 304 12.7.5 virtual void on_scheduler_entry( bool is_worker) ....................... 304 12.7.6 virtual void on_scheduler_exit( bool is_worker ) ........................ 305 12.8 Catalog of Recommended task Patterns ................................................. 305 12.8.1 Blocking Style With k Children................................................. 306 12.8.2 Continuation-Passing Style With k Children ............................... 306 12.8.2.1 Recycling Parent as Continuation .............................. 307 12.8.2.2 Recycling Parent as a Child ...................................... 307 12.8.3 Letting Main Thread Work While Child Tasks Run ....................... 308 13 Exceptions ................................................................................................... 310 13.1 tbb_exception.................................................................................... 310 13.2 captured_exception ............................................................................ 311 13.2.1 captured_exception( const char* name, const char* info ) .......... 312 13.3 movable_exception .................................................... 312 13.3.1 movable_exception( const ExceptionData& src ) ........................ 313 13.3.2 ExceptionData& data() throw()................................................ 313 13.3.3 const ExceptionData& data() const throw() ............................... 314 13.4 Specific Exceptions ............................................................................. 314 14 Threads ....................................................................................................... 316 14.1 thread Class ...................................................................................... 317 14.1.1 thread() ............................................................................... 318 14.1.2 template thread(F f).......................................... 318 14.1.3 template thread(F f, X x)................. 318 14.1.4 template thread(F f, X x, Y y) ....................................................................................... 318 14.1.5 thread& operator=(thread& x) ................................................ 318 14.1.6 ~thread ............................................................................... 319 14.1.7 bool joinable() const .............................................................. 319 14.1.8 void join() ............................................................................ 319 14.1.9 void detach() ........................................................................ 319 14.1.10 id get_id() const.................................................................... 319 14.1.11 native_handle_type native_handle() ........................................ 320 14.1.12 static unsigned hardware_concurrency()................................... 320 14.2 thread::id ......................................................................................... 320 14.3 this_thread Namespace ....................................................................... 321 14.3.1 thread::id get_id() ................................................................ 321 14.3.2 void yield()........................................................................... 321 14.3.3 void sleep_for( const tick_count::interval_t & i)......................... 321Intel(R) Threading Building Blocks xxii 315415-014US 15 References................................................................................................... 323 Appendix A Compatibility Features ................................................................................... 324 A.1 parallel_while Template Class............................................................... 324 A.1.1 parallel_while().......................................................... 325 A.1.2 ~parallel_while() ....................................................... 326 A.1.3 Template void run( Stream& stream, const Body& body )........................................................................ 326 A.1.4 void add( const value_type& item ).......................................... 326 A.2 Interface for constructing a pipeline filter............................................... 326 A.2.1 filter::filter( bool is_serial )..................................................... 326 A.2.2 filter::serial .......................................................................... 327 A.3 Debugging Macros .............................................................................. 327 A.4 tbb::deprecated::concurrent_queue Template Class .................. 327 A.5 Interface for concurrent_vector ............................................................ 329 A.5.1 void compact()...................................................................... 330 A.6 Interface for class task........................................................................ 330 A.6.1 void recycle _to_reexecute()................................................... 330 A.6.2 Depth interface for class task .................................................. 331 A.7 tbb_thread Class ................................................................................ 331 Appendix B PPL Compatibility .......................................................................................... 332 Appendix C Known Issues ............................................................................................... 333 C.1 Windows* OS .................................................................................... 333 Appendix D Community Preview Features.......................................................................... 334 D.1 Flow Graph........................................................................................ 335 D.1.1 or_node Template Class ......................................................... 335 D.1.2 multioutput_function_node Template Class ............................... 339 D.1.3 split_node Template Class ...................................................... 343 D.2 Run-time loader ................................................................................. 346 D.2.1 runtime_loader Class ............................................................. 348 D.3 parallel_ deterministic _reduce Template Function................................... 350 D.4 Scalable Memory Pools........................................................................ 353 D.4.1 memory_pool Template Class.................................................. 353 D.4.2 fixed_pool Class .................................................................... 355 D.4.3 memory_pool_allocator Template Class .................................... 356 D.5 Serial subset ..................................................................................... 358 D.5.1 tbb::serial::parallel_for() ....................................................... 358Overview Reference Manual 1 1 Overview Intel® Threading Building Blocks (Intel® TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. Many of the library interfaces employ generic programming, in which interfaces are defined by requirements on types and not specific types. The C++ Standard Template Library (STL) is an example of generic programming. Generic programming enables Intel® Threading Building Blocks to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs. The net result is that Intel® Threading Building Blocks enables you to specify parallelism far more conveniently than using raw threads, and at the same time can improve performance. This document is a reference manual. It is organized for looking up details about syntax and semantics. You should first read the Intel® Threading Building Blocks Getting Started Guide and the Intel® Threading Building Blocks Tutorial to learn how to use the library effectively. The Intel® Threading Building Blocks Design Patterns document is another useful resource. TIP: Even experienced parallel programmers should read the Intel® Threading Building Blocks Tutorial before using this reference guide because Intel® Threading Building Blocks uses a surprising recursive model of parallelism and generic algorithms. 2 315415-014US 2 General Conventions This section describes conventions used in this document. 2.1 Notation Literal program text appears in Courier font. Algebraic placeholders are in monospace italics. For example, the notation blocked_range indicates that blocked_range is literal, but Type is a notational placeholder. Real program text replaces Type with a real type, such as in blocked_range. Class members are summarized by informal class declarations that describe the class as it seems to clients, not how it is actually implemented. For example, here is an informal declaration of class Foo: class Foo { public: int x(); int y; ~Foo(); }; The actual implementation might look like: namespace internal { class FooBase { protected: int x(); }; class Foo_v3: protected FooBase { private: int internal_stuff; public: using FooBase::x; int y; }; } typedef internal::Foo_v3 Foo; General Conventions Reference Manual 3 The example shows two cases where the actual implementation departs from the informal declaration: • Foo is actually a typedef to Foo_v3. • Method x() is inherited from a protected base class. • The destructor is an implicit method generated by the compiler. The informal declarations are intended to show you what you need to know to use the class without the distraction of irrelevant clutter particular to the implementation. 2.2 Terminology This section describes terminology specific to Intel® Threading Building Blocks (Intel® TBB). 2.2.1 Concept A concept is a set of requirements on a type. The requirements may be syntactic or semantic. For example, the concept of “sortable” could be defined as a set of requirements that enable an array to be sorted. A type T would be sortable if: • x < y returns a boolean value, and represents a total order on items of type T. • swap(x,y) swaps items x and y You can write a sorting template function in C++ that sorts an array of any type that is sortable. Two approaches for defining concepts are valid expressions and pseudo-signatures0 1 . The ISO C++ standard follows the valid expressions approach, which shows what the usage pattern looks like for a concept. It has the drawback of relegating important details to notational conventions. This document uses pseudo-signatures, because they are concise, and can be cut-and-pasted for an initial implementation. For example, Table 1 shows pseudo-signatures for a sortable type 449H877H T: 1 See Section 3.2.3 of Concepts for C++0x available at http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2005/n1758.pdf for further discussion of valid expressions versus pseudo-signatures. 4 315415-014US Table 1: Pseudo-Signatures for Example Concept “sortable” Pseudo-Signature Semantics bool operator<(const T& x, const T& y) Compare x and y. void swap(T& x, T& y) Swap x and y. A real signature may differ from the pseudo-signature that it implements in ways where implicit conversions would deal with the difference. For an example type U, the real signature that implements operator< in Table 1 can be expressed as 450H878H int operator<( U x, U y ), because C++ permits implicit conversion from int to bool, and implicit conversion from U to (const U&). Similarly, the real signature bool operator<( U& x, U& y ) is acceptable because C++ permits implicit addition of a const qualifier to a reference type. 2.2.2 Model A type models a concept if it meets the requirements of the concept. For example, type int models the sortable concept in Table 1 if there exists a function 451H879H swap(x,y) that swaps two int values x and y. The other requirement for sortable, specifically x as tbb::version4::concurrent_hashmap and employs a using directive to inject it into namespace tbb. Your source code should reference it as tbb::concurrent_hashmap. 2.4.4 tbb::internal Namespace Namespace tbb::internal serves a role similar to tbb::interfacex. It is retained for backwards compatibility with older versions of the library. Your code should never 6 315415-014US directly reference namespace tbb::internal. Indirect reference via a public typedef provided by the header files is permitted. 2.4.5 tbb::deprecated Namespace The library uses the namespace tbb::deprecated for deprecated identifiers that have different default meanings in namespace tbb. Compiling with TBB_DEPRECATED=1 causes such identifiers to replace their counterpart in namespace tbb. For example, tbb::concurrent_queue underwent changes in Intel® TBB 2.2 that split its functionality into tbb::concurrent_queue and tbb::concurrent_bounded_queue and changed the name of some methods. For sake of legacy code, the old Intel® TBB 2.1 functionality is retained in tbb::deprecated::concurrent_queue, which is injected into namespace tbb when compiled with TBB_DEPRECATED=1. 2.4.6 tbb::strict_ppl Namespace The library uses the namespace tbb::strict_ppl for identifiers that are put in namespace Concurrency when tbb/compat/ppl.h is included. 2.4.7 std Namespace The library implements some C++0x features in namespace std. The library version can be used by including the corresponding header in Table 3. 882H Table 3: C++0x Features Optonally Defined by Intel® Threading Building Blocks. Header Identifiers Added to std:: Section tbb/compat/condition_variable defer_lock_t try_to_lock_t adopt_lock_t defer_lock try_to_lock adopt_lock lock_guard unique_lock swap1F 2 condition_variable cv_status timeout no_timeout 9.4 1H883H tbb/compat/thread thread 14.1 2H884H 2 Adds swap of two unique_lock objects, not the general swap template function. General Conventions Reference Manual 7 this_thread To prevent accidental linkage with other implementations of these C++ library features, the library defines the identifiers in other namespaces and injects them into namespace std::. This way the “mangled name” seen by the linker will differ from the “mangled name” generated by other implementations. 2.5 Thread Safety Unless otherwise stated, the thread safety rules for the library are as follows: • Two threads can invoke a method or function concurrently on different objects, but not the same object. • It is unsafe for two threads to invoke concurrently methods or functions on the same object. Descriptions of the classes note departures from this convention. For example, the concurrent containers are more liberal. By their nature, they do permit some concurrent operations on the same container object. 8 315415-014US 3 Environment This section describes features of Intel® Threadinging Building Blocks (Intel® TB) that relate to general environment issues. 3.1 Version Information Intel® TBB has macros, an environment variable, and a function that reveal version and run-time information. 3.1.1 Version Macros The header tbb/tbb_stddef.h defines macros related to versioning, as described in Table 4. You should not redefine these macros. 885H Table 4: Version Macros Macro Description of Value TBB_INTERFACE_VERSION Current interface version. The value is a decimal numeral of the form xyyy where x is the major version number and y is the minor version number. TBB_INTERFACE_VERSION_MAJOR TBB_INTERFACE_VERSION/1000; that is, the major version number. TBB_COMPATIBLE_INTERFACE_VERSION Oldest major interface version still supported. 3.1.2 TBB_VERSION Environment Variable Set the environment variable TBB_VERSION to 1 to cause the library to print information on stderr. Each line is of the form “TBB: tag value”, where tag and value are described in Table 5. 886H Table 5: Output from TBB_VERSION Tag Description of Value VERSION Intel® TBB product version number. INTERFACE_VERSION Value of macro TBB_INTERFACE_VERSION when library was compiled. Environment Reference Manual 9 BUILD_... Various information about the machine configuration on which the library was built. TBB_USE_ASSERT Setting of macro TBB_USE_ASSERT DO_ITT_NOTIFY 1 if library can enable instrumentation for Intel® Parallel Studio and Intel® Threading Tools; 0 or undefined otherwise. ITT yes if library has enabled instrumentation for Intel® Parallel Studio and Intel® Threadng Tools, no otherwise. Typically yes only if the program is running under control of Intel® Parallel Studio or Intel® Threadng Tools. ALLOCATOR Underlying allocator for tbb::tbb_allocator. It is scalable_malloc if the Intel® TBB malloc library was successfully loaded; malloc otherwise. CAUTION: This output is implementation specific and may change at any time. 3.1.3 TBB_runtime_interface_version Function Summary Function that returns the interface version of the Intel® TBB library that was loaded at runtime. Syntax extern “C” int TBB_runtime_interface_version(); Header #include "tbb/tbb_stddef.h" Description The value returned by TBB_runtime_interface_version() may differ from the value of TBB_INTERFACE_VERSION obtained at compile time. This can be used to identify whether an application was compiled against a compatible version of the Intel® TBB headers. In general, the run-time value TBB_runtime_interface_version() must be greater than or equal to the compile-time value of TBB_INTERFACE_VERSION. Otherwise the application may fail to resolve all symbols at run time. 3.2 Enabling Debugging Features Four macros control certain debugging features. In general, it is useful to compile with these features on for development code, and off for production code, because the features may decrease performance. Table 6 summarizes the macros and their default 887H10 315415-014US values. A value of 1 enables the corresponding feature; a value of 0 disables the feature. Table 6: Debugging Macros Macro Default Value Feature Windows* OS: 1 if _DEBUG is defined, 0 otherwise. TBB_USE_DEBUG All other systems: 0. Default value for all other macros in this table. TBB_USE_ASSERT Enable internal assertion checking. Can significantly slow performance. TBB_USE_THREADING_TOOLS Enable full support for Intel® Parallel Studio and Intel® Threading Tools. TBB_USE_PERFORMANCE_WARNINGS TBB_USE_DEBUG Enable warnings about performance issues. 3.2.1 TBB_USE_ASSERT Macro The macro TBB_USE_ASSERT controls whether error checking is enabled in the header files. Define TBB_USE_ASSERT as 1 to enable error checking. If an error is detected, the library prints an error message on stderr and calls the standard C routine abort. To stop a program when internal error checking detects a failure, place a breakpoint on tbb::assertion_failure. TIP: On Microsoft Windows* operating systems, debug builds implicitly set TBB_USE_ASSERT to 1 by default 3.2.2 TBB_USE_THREADING_TOOLS Macro The macro TBB_USE_THREADING_TOOLS controls support for Intel® Threading Tools: • Intel® Parallel Inspector • Intel® Parallel Amplifier • Intel® Thread Profiler • Intel® Thread Checker. Environment Reference Manual 11 Define TBB_USE_THREADING_TOOLS as 1 to enable full support for these tools. That is full support is enabled if error checking is enabled. Leave TBB_USE_THREADING_TOOLS undefined or zero to enable top performance in release builds, at the expense of turning off some support for tools. 3.2.3 TBB_USE_PERFORMANCE_WARNINGS Macro The macro TBB_USE_PERFORMANCE_WARNINGS controls performance warnings. Define it to be 1 to enable the warnings. Currently, the warnings affected are: • Some that report poor hash functions for concurrent_hash_map. Enabling the warnings may impact performance. • Misaligned 8-byte atomic stores on Intel® IA-32 processors. 3.3 Feature macros Macros in this section control optional features in the library. 3.3.1 TBB_DEPRECATED macro The macro TBB_DEPRECATED controls deprecated features that would otherwise conflict with non-deprecated use. Define it to be 1 to get deprecated Intel® TBB 2.1 interfaces. Appendix A describes deprecated features. 888H 3.3.2 TBB_USE_EXCEPTIONS macro The macro TBB_USE_EXCEPTIONS controls whether the library headers use exceptionhandling constructs such as try, catch, and throw. The headers do not use these constructs when TBB_USE_EXCEPTIONS=0. For the Microsoft Windows*, Linux*, and MacOS* operating systems, the default value is 1 if exception handling constructs are enabled in the compiler, and 0 otherwise. CAUTION: The runtime library may still throw an exception when TBB_USE_EXCEPTIONS=0. 12 315415-014US 3.3.3 TBB_USE_CAPTURED_EXCEPTION macro The macro TBB_USE_CAPTURED_EXCEPTION controls rethrow of exceptions within the library. Because C++ 1998 does not support catching an exception on one thread and rethrowing it on another thread, the library sometimes resorts to rethrowing an approximation called tbb::captured_exception 3H . • Define TBB_USE_CAPTURED_EXCEPTION=1 to make the library rethrow an approximation. This is useful for uniform behavior across platforms. • Define TBB_USE_CAPTURED_EXCEPTION=0 to request rethrow of the exact exception. This setting is valid only on platforms that support the std::exception_ptr feature of C++ 200x. Otherwise a compile-time diagnostic is issued. The default value is 1 for supported host compilers with std::exception_ptr, and 0 otherwise. Section 13 describes exception handling and 889H TBB_USE_CAPTURED_EXCEPTION in more detail. Algorithms Reference Manual 13 4 Algorithms Most parallel algorithms provided by Intel® Threading Building Blocks (Intel® TBB) are generic. They operate on all types that model the necessary concepts. Parallel algorithms may be nested. For example, the body of a parallel_for can invoke another parallel_for. CAUTION: When the body of an outer parallel algorithm invokes another parallel algorithm, it may cause the outer body to be re-entered for a different iteration of the outer algorithm. For example, if the outer body holds a global lock while calling an inner parallel algorithm, the body will deadlock if the re-entrant invocation attempts to acquire the same global lock. This ill-formed example is a special case of a general rule that code should not hold a lock while calling code written by another author. 4.1 Splittable Concept Summary Requirements for a type whose instances can be split into two pieces. Requirements Table 7 lists the requirements for a splittable type 454H890H X with instance x. Table 7: Splittable Concept Pseudo-Signature Semantics X::X(X& x, Split) Split x into x and newly constructed object. Description A type is splittable if it has a splitting constructor that allows an instance to be split into two pieces. The splitting constructor takes as arguments a reference to the original object, and a dummy argument of type Split, which is defined by the library. The dummy argument distinguishes the splitting constructor from a copy constructor. After the constructor runs, x and the newly constructed object should represent the two pieces of the original x. The library uses splitting constructors in three contexts: • Partitioning a range into two subranges that can be processed concurrently. • Forking a body (function object) into two bodies that can run concurrently. 14 315415-014US The following model types provide examples. Model Types blocked_range (4.2.1) and 891H blocked_range2d (4.2.2) represent splittable ranges. For 892H each of these, splitting partitions the range into two subranges. See the example in Section 4.2.1.3 for the splitting constructor of 893H blocked_range. The bodies for parallel_reduce (4.5) and 894H parallel_scan (4.6) must be splittable. 895H For each of these, splitting results in two bodies that can be run concurrently. 4.1.1 split Class Summary Type for dummy argument of a splitting constructor. Syntax class split; Header #include "tbb/tbb_stddef.h" Description An argument of type split is used to distinguish a splitting constructor from a copy constructor. Members namespace tbb { class split { }; } 4.2 Range Concept Summary Requirements for type representing a recursively divisible set of values. Requirements Table 8 455H896H lists the requirements for a Range type R.Algorithms Reference Manual 15 Table 8: Range Concept Pseudo-Signature Semantics R::R( const R& ) Copy constructor. R::~R() Destructor. bool R::empty() const True if range is empty. bool R::is_divisible() const True if range can be partitioned into two subranges. R::R( R& r, split ) Split r into two subranges. Description A Range can be recursively subdivided into two parts. It is recommended that the division be into nearly equal parts, but it is not required. Splitting as evenly as possible typically yields the best parallelism. Ideally, a range is recursively splittable until the parts represent portions of work that are more efficient to execute serially rather than split further. The amount of work represented by a Range typically depends upon higher level context, hence a typical type that models a Range should provide a way to control the degree of splitting. For example, the template class blocked_range (4.2.1) 897H has a grainsize parameter that specifies the biggest range considered indivisible. The constructor that implements splitting is called a splitting constructor. If the set of values has a sense of direction, then by convention the splitting constructor should construct the second part of the range, and update the argument to be the first half. Following this convention causes the parallel_for (4.4), 456H898H parallel_reduce (4.5), and 457H899H parallel_scan (4.6) algorithms, when running sequentially, to work across a range in 900H the increasing order typical of an ordinary sequential loop. Example The following code defines a type TrivialIntegerRange that models the Range concept. It represents a half-open interval [lower,upper) that is divisible down to a single integer. struct TrivialIntegerRange { int lower; int upper; bool empty() const {return lower==upper;} bool is_divisible() const {return upper>lower+1;} TrivialIntegerRange( TrivialIntegerRange& r, split ) { int m = (r.lower+r.upper)/2; lower = m; upper = r.upper; r.upper = m; } }; 16 315415-014US TrivialIntegerRange is for demonstration and not very practical, because it lacks a grainsize parameter. Use the library class blocked_range instead. Model Types Type blocked_range (4.2.1) models a one-dimensional range. 901H Type blocked_range2d (4.2.2) models a two-dimensional range. 902H Type blocked_range3d (4.2.3) models a three-dimensional range. 903H Concept Container Range (5.1) models a container as a range. 904H 4.2.1 blocked_range Template Class Summary Template class for a recursively divisible half-open interval. Syntax template class blocked_range; Header #include "tbb/blocked_range.h" Description A blocked_range represents a half-open range [i,j) that can be recursively split. The types of i and j must model the requirements in Table 9. In the table, type D 461H905H is the type of the expression “j-i”. It can be any integral type that is convertible to size_t. Examples that model the Value requirements are integral types, pointers, and STL random-access iterators whose difference can be implicitly converted to a size_t. A blocked_range models the Range concept (4.2). 462H906H Table 9: Value Concept for blocked_range Pseudo-Signature Semantics Value::Value( const Value& ) Copy constructor. Algorithms Reference Manual 17 Value::~Value() Destructor. void2F 3 operator=( const Value& ) Assignment bool operator<( const Value& i, const Value& j ) Value i precedes value j. D operator-( const Value& i, const Value& j ) Number of values in range [i,j). Value operator+( const Value& i, D k ) kth value after i. A blocked_range specifies a grainsize of type size_t. A blocked_range is splittable into two subranges if the size of the range exceeds grain size. The ideal grain size depends upon the context of the blocked_range, which is typically as the range argument to the loop templates parallel_for, parallel_reduce, or parallel_scan. A too small grainsize may cause scheduling overhead within the loop templates to swamp speedup gained from parallelism. A too large grainsize may unnecessarily limit parallelism. For example, if the grain size is so large that the range can be split only once, then the maximum possible parallelism is two. Here is a suggested procedure for choosing grainsize: 1. Set the grainsize parameter to 10,000. This value is high enough to amortize scheduler overhead sufficiently for practically all loop bodies, but may be unnecessarily limit parallelism. 2. Run your algorithm on one processor. 3. Start halving the grainsize parameter and see how much the algorithm slows down as the value decreases. A slowdown of about 5-10% is a good setting for most purposes. TIP: For a blocked_range [i,j) where j typically appears as a range argument to a loop template. See the examples for parallel_for (4.4), 911H parallel_reduce (4.5), and 912H parallel_scan (4.6). 913H 3 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored by blocked_range. 18 315415-014US Members namespace tbb { template class blocked_range { public: // types typedef size_t size_type; typedef Value const_iterator; // constructors blocked_range( Value begin, Value end, size_type grainsize=1 ); blocked_range( blocked_range& r, split ); // capacity size_type size() const; bool empty() const; // access size_type grainsize() const; bool is_divisible() const; // iterators const_iterator begin() const; const_iterator end() const; }; } 4.2.1.1 size_type Description The type for measuring the size of a blocked_range. The type is always a size_t. const_iterator Description The type of a value in the range. Despite its name, the type const_iterator is not necessarily an STL iterator; it merely needs to meet the Value requirements in Table 9. 914H However, it is convenient to call it const_iterator so that if it is a const_iterator, then the blocked_range behaves like a read-only STL container. Algorithms Reference Manual 19 4.2.1.2 blocked_range( Value begin, Value end, size_t grainsize=1 ) Requirements The parameter grainsize must be positive. The debug version of the library raises an assertion failure if this requirement is not met. Effects Constructs a blocked_range representing the half-open interval [begin,end) with the given grainsize. Example The statement “blocked_range r( 5, 14, 2 );” constructs a range of int that contains the values 5 through 13 inclusive, with a grainsize of 2. Afterwards, r.begin()==5 and r.end()==14. 4.2.1.3 blocked_range( blocked_range& range, split ) Requirements is_divisible() is true. Effects Partitions range into two subranges. The newly constructed blocked_range is approximately the second half of the original range, and range is updated to be the remainder. Each subrange has the same grainsize as the original range. Example Let i and j be integers that define a half-open interval [i,j) and let g specifiy a grain size. The statement blocked_range r(i,j,g) constructs a blocked_range that represents [i,j) with grain size g. Running the statement blocked_range s(r,split); subsequently causes r to represent [i, i +(j -i)/2) and s to represent [i +(j -i)/2, j), both with grain size g. 4.2.1.4 size_type size() const Requirements end()grainsize(); false otherwise. 4.2.1.8 const_iterator begin() const Returns Inclusive lower bound on range. 4.2.1.9 const_iterator end() const Returns Exclusive upper bound on range.Algorithms Reference Manual 21 4.2.2 blocked_range2d Template Class Summary Template class that represents recursively divisible two-dimensional half-open interval. Syntax template class blocked_range2d; Header #include "tbb/blocked_range2d.h" Description A blocked_range2d represents a half-open two dimensional range [i0,j0)×[i1,j1). Each axis of the range has its own splitting threshold. The RowValue and ColValue must meet the requirements in Table 9. A 463H915H blocked_range is splittable if either axis is splittable. A blocked_range models the Range concept (4.2). 464H916H Members namespace tbb { template class blocked_range2d { public: // Types typedef blocked_range row_range_type; typedef blocked_range col_range_type; // Constructors blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize); blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end); blocked_range2d( blocked_range2d& r, split ); // Capacity bool empty() const; // Access bool is_divisible() const; const row_range_type& rows() const; const col_range_type& cols() const; 22 315415-014US }; } Example The code that follows shows a serial matrix multiply, and the corresponding parallel matrix multiply that uses a blocked_range2d to specify the iteration space. const size_t L = 150; const size_t M = 225; const size_t N = 300; void SerialMatrixMultiply( float c[M][N], float a[M][L], float b[L][N] ) { for( size_t i=0; i& r ) const { float (*a)[L] = my_a; float (*b)[N] = my_b; float (*c)[N] = my_c; for( size_t i=r.rows().begin(); i!=r.rows().end(); ++i ){ for( size_t j=r.cols().begin(); j!=r.cols().end(); ++j ) { Algorithms Reference Manual 23 float sum = 0; for( size_t k=0; k(0, M, 16, 0, N, 32), MatrixMultiplyBody2D(c,a,b) ); } The blocked_range2d enables the two outermost loops of the serial version to become parallel loops. The parallel_for recursively splits the blocked_range2d until the pieces are no larger than 16×32. It invokes MatrixMultiplyBody2D::operator() on each piece. 4.2.2.1 row_range_type Description A blocked_range. That is, the type of the row values. 4.2.2.2 col_range_type Description A blocked_range. That is, the type of the column values.24 315415-014US 4.2.2.3 blocked_range2d( RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize ) Effects Constructs a blocked_range2d representing a two dimensional space of values. The space is the half-open Cartesian product [row_begin,row_end)× [col_begin,col_end), with the given grain sizes for the rows and columns. Example The statement “blocked_range2d r(’a’, ’z’+1, 3, 0, 10, 2 );” constructs a two-dimensional space that contains all value pairs of the form (i, j), where i ranges from ’a’ to ’z’ with a grain size of 3, and j ranges from 0 to 9 with a grain size of 2. 4.2.2.4 blocked_range2d( RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end) Effects Same as blocked_range2d(row_begin,row_end,1,col_begin,col_end,1). 4.2.2.5 blocked_range2d ( blocked_range2d& range, split ) Effects Partitions range into two subranges. The newly constructed blocked_range2d is approximately the second half of the original range, and range is updated to be the remainder. Each subrange has the same grain size as the original range. The split is either by rows or columns. The choice of which axis to split is intended to cause, after repeated splitting, the subranges to approach the aspect ratio of the respective row and column grain sizes. For example, if the row_grainsize is twice col_grainsize, the subranges will tend towards having twice as many rows as columns. 4.2.2.6 bool empty() const Effects Determines if range is empty.Algorithms Reference Manual 25 Returns rows().empty()||cols().empty() 4.2.2.7 bool is_divisible() const Effects Determines if range can be split into subranges. Returns rows().is_divisible()||cols().is_divisible() 4.2.2.8 const row_range_type& rows() const Returns Range containing the rows of the value space. 4.2.2.9 const col_range_type& cols() const Returns Range containing the columns of the value space. 4.2.3 blocked_range3d Template Class Summary Template class that represents recursively divisible three-dimensional half-open interval. Syntax template class blocked_range3d; Header #include "tbb/blocked_range3d.h" Description A blocked_range3d is the three-dimensional extension of blocked_range2d. Members namespace tbb { 26 315415-014US template class blocked_range3d { public: // Types typedef blocked_range page_range_type; typedef blocked_range row_range_type; typedef blocked_range col_range_type; // Constructors blocked_range3d( PageValue page_begin, PageValue page_end, typename page_range_type::size_type page_grainsize, RowValue row_begin, RowValue row_end, typename row_range_type::size_type row_grainsize, ColValue col_begin, ColValue col_end, typename col_range_type::size_type col_grainsize); blocked_range3d( PageValue page_begin, PageValue page_end, RowValue row_begin, RowValue row_end, ColValue col_begin, ColValue col_end); blocked_range3d( blocked_range3d& r, split ); // Capacity bool empty() const; // Access bool is_divisible() const; const page_range_type& pages() const; const row_range_type& rows() const; const col_range_type& cols() const; }; } 4.3 Partitioners Summary A partitioner specifies how a loop template should partition its work among threads. Description The default behavior of the loop templates parallel_for (4.4), 917H parallel_reduce (4.5), and 918H parallel_scan (4.6) tries to recursively split a range into enough parts to 919H keep processors busy, not necessarily splitting as finely as possible. An optional Algorithms Reference Manual 27 partitioner parameter enables other behaviors to be specified, as shown in Table 10. 920H The first column of the table shows how the formal parameter is declared in the loop templates. An affinity_partitioner is passed by non-const reference because it is updated to remember where loop iterations run. Table 10: Partitioners Partitioner Loop Behavior const auto_partitioner& (default)3F 4 Performs sufficient splitting to balance load, not necessarily splitting as finely as Range::is_divisible permits. When used with classes such as blocked_range, the selection of an appropriate grainsize is less important, and often acceptable performance can be achieved with the default grain size of 1. affinity_partitioner& Similar to auto_partitioner, but improves cache affinity by its choice of mapping subranges to worker threads. It can improve performance significantly when a loop is re-executed over the same data set, and the data set fits in cache. const simple_partitioner& Recursively splits a range until it is no longer divisible. The Range::is_divisible function is wholly responsible for deciding when recursive splitting halts. When used with classes such as blocked_range, the selection of an appropriate grainsize is critical to enabling concurrency while limiting overheads (see the discussion in Section 4.2.1). 921H 4.3.1 auto_partitioner Class Summary Specify that a parallel loop should optimize its range subdivision based on workstealing events. Syntax class auto_partitioner; 4 In Intel® TBB 2.1, simple_partitioner was the default. Intel® TBB 2.2 changed the default to auto_partitioner to simplify common usage of the loop templates. To get the old default, compile with the preprocessor symbol TBB_DEPRECATED=1. 28 315415-014US Header #include "tbb/partitioner.h" Description A loop template with an auto_partitioner attempts to minimize range splitting while providing ample opportunities for work-stealing. The range subdivision is initially limited to S subranges, where S is proportional to the number of threads specified by the task_scheduler_init (12.2.1). Each of these 922H subranges is not divided further unless it is stolen by an idle thread. If stolen, it is further subdivided to create additional subranges. Thus a loop template with an auto_partitioner creates additional subranges only when necessary to balance load. TIP: When using auto_partitioner and a blocked_range for a parallel loop, the body may be passed a subrange larger than the blocked_range’s grainsize. Therefore do not assume that the grainsize is an upper bound on the size of the subrange. Use a simple_partitioner if an upper bound is required. Members namespace tbb { class auto_partitioner { public: auto_partitioner(); ~auto_partitioner(); } } 4.3.1.1 auto_partitioner() Construct an auto_partitioner. 4.3.1.2 ~auto_partitioner() Destroy this auto_partitioner. 4.3.2 affinity_partitioner Summary Hint that loop iterations should be assigned to threads in a way that optimizes for cache affinity. Syntax class affinity_partitioner;Algorithms Reference Manual 29 Header #include "tbb/partitioner.h" Description An affinity_partitioner hints that execution of a loop template should assign iterations to the same processors as another execution of the loop (or another loop) with the same affinity_partitioner object. Unlike the other partitioners, it is important that the same affinity_partitioner object be passed to the loop templates to be optimized for affinity. The Tutorial (Section 3.2.3 “Bandwidth and Cache Affinity”) discusses affinity effects in detail. TIP: The affinity_partitioner generally improves performance only when: • The computation does a few operations per data access. • The data acted upon by the loop fits in cache. • The loop, or a similar loop, is re-executed over the same data. • There are more than two hardware threads available. Members namespace tbb { class affinity_partitioner { public: affinity_partitioner(); ~affinity_partitioner(); } } Example The following example can benefit from cache affinity. The example simulates a one dimensional additive automaton. #include "tbb/blocked_range.h" #include "tbb/parallel_for.h" #include "tbb/partitioner.h" using namespace tbb; const int N = 1000000; typedef unsigned char Cell; Cell Array[2][N]; int FlipFlop; struct TimeStepOverSubrange { void operator()( const blocked_range& r ) const { 30 315415-014US int j = r.end(); const Cell* x = Array[FlipFlop]; Cell* y = Array[!FlipFlop]; for( int i=r.begin(); i!=j; ++i ) y[i] = x[i]^x[i+1]; } }; void DoAllTimeSteps( int m ) { affinity_partitioner ap; for( int k=0; k(0,N-1), TimeStepOverSubrange(), ap ); FlipFlop ^= 1; } } For each time step, the old state of the automaton is read from Array[FlipFlop], and the new state is written into Array[!FlipFlop]. Then FlipFlop flips to make the new state become the old state. The aggregate size of both states is about 2 MByte, which fits in most modern processors’ cache. Improvements ranging from 50%-200% have been observed for this example on 8 core machines, compared with using an auto_partitioner instead. The affinity_partitioner must live between loop iterations. The example accomplishes this by declaring it outside the loop that executes all iterations. An alternative would be to declare the affinity partitioner at the file scope, which works as long as DoAllTimeSteps itself is not invoked concurrently. The same instance of affinity_partitioner should not be passed to two parallel algorithm templates that are invoked concurrently. Use separate instances instead. 4.3.2.1 affinity_partitioner() Construct an affinity_partitioner. 4.3.2.2 ~affinity_partitioner() Destroy this affinity_partitioner. 4.3.3 simple_partitioner Class Summary Specify that a parallel loop should recursively split its range until it cannot be subdivided further. Algorithms Reference Manual 31 Syntax class simple_partitioner; Header #include "tbb/partitioner.h" Description A simple_partitioner specifies that a loop template should recursively divide its range until for each subrange r, the condition !r.is_divisible() holds. This is the default behavior of the loop templates that take a range argument. TIP: When using simple_partitioner and a blocked_range for a parallel loop, be careful to specify an appropriate grainsize for the blocked_range. The default grainsize is 1, which may make the subranges much too small for efficient execution. Members namespace tbb { class simple_partitioner { public: simple_partitioner(); ~simple_partitioner(); } } 4.3.3.1 simple_partitioner() Construct a simple_partitioner. 4.3.3.2 ~simple_partitioner() Destroy this simple_partitioner. 4.4 parallel_for Template Function Summary Template function that performs parallel iteration over a range of values. Syntax template Func parallel_for( Index first, Index_type last, const Func& f [, task_group_context& group] ); template 32 315415-014US Func parallel_for( Index first, Index_type last, Index step, const Func& f [, task_group_context& group] ); template void parallel_for( const Range& range, const Body& body, [, partitioner[, task_group_context& group]] ); where the optional partitioner declares any of the partitioners as shown in column 1 of Table 10. 923H Header #include "tbb/parallel_for.h" Description A parallel_for(first,last,step,f) represents parallel execution of the loop: for( auto i=first; i& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.f); } }; // Note: Reads input[0..n] and writes output[1..n-1]. void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; 34 315415-014US avg.output = output; parallel_for( blocked_range( 1, n ), avg ); } Example This example is more complex and requires familiarity with STL. It shows the power of parallel_for beyond flat iteration spaces. The code performs a parallel merge of two sorted sequences. It works for any sequence with a random-access iterator. The algorithm (Akl 1987) works recursively as follows: 1. If the sequences are too short for effective use of parallelism, do a sequential merge. Otherwise perform steps 2-6. 2. Swap the sequences if necessary, so that the first sequence [begin1,end1) is at least as long as the second sequence [begin2,end2). 3. Set m1 to the middle position in [begin1,end1). Call the item at that location key. 4. Set m2 to where key would fall in [begin2,end2). 5. Merge [begin1,m1) and [begin2,m2) to create the first part of the merged sequence. 6. Merge [m1,end1) and [m2,end2) to create the second part of the merged sequence. The Intel® Threading Building Blocks implementation of this algorithm uses the range object to perform most of the steps. Predicate is_divisible performs the test in step 1, and step 2. The splitting constructor does steps 3-6. The body object does the sequential merges. #include "tbb/parallel_for.h" #include using namespace tbb; template struct ParallelMergeRange { static size_t grainsize; Iterator begin1, end1; // [begin1,end1) is 1st sequence to be merged Iterator begin2, end2; // [begin2,end2) is 2nd sequence to be merged Iterator out; // where to put merged sequence bool empty() const {return (end1-begin1)+(end2-begin2)==0;} bool is_divisible() const { return std::min( end1-begin1, end2-begin2 ) > grainsize; } ParallelMergeRange( ParallelMergeRange& r, split ) { if( r.end1-r.begin1 < r.end2-r.begin2 ) { std::swap(r.begin1,r.begin2); Algorithms Reference Manual 35 std::swap(r.end1,r.end2); } Iterator m1 = r.begin1 + (r.end1-r.begin1)/2; Iterator m2 = std::lower_bound( r.begin2, r.end2, *m1 ); begin1 = m1; begin2 = m2; end1 = r.end1; end2 = r.end2; out = r.out + (m1-r.begin1) + (m2-r.begin2); r.end1 = m1; r.end2 = m2; } ParallelMergeRange( Iterator begin1_, Iterator end1_, Iterator begin2_, Iterator end2_, Iterator out_ ) : begin1(begin1_), end1(end1_), begin2(begin2_), end2(end2_), out(out_) {} }; template size_t ParallelMergeRange::grainsize = 1000; template struct ParallelMergeBody { void operator()( ParallelMergeRange& r ) const { std::merge( r.begin1, r.end1, r.begin2, r.end2, r.out ); } }; template void ParallelMerge( Iterator begin1, Iterator end1, Iterator begin2, Iterator end2, Iterator out ) { parallel_for( ParallelMergeRange(begin1,end1,begin2,end2,out), ParallelMergeBody(), simple_partitioner() ); } Because the algorithm moves many locations, it tends to be bandwidth limited. Speedup varies, depending upon the system. 36 315415-014US 4.5 parallel_reduce Template Function Summary Computes reduction over a range. Syntax template Value parallel_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction, [, partitioner[, task_group_context& group]] ); template void parallel_reduce( const Range& range, const Body& body [, partitioner[, task_group_context& group]] ); where the optional partitioner declares any of the partitioners as shown in column 1 of Table 10. 927H Header #include "tbb/parallel_reduce.h" Description The parallel_reduce template has two forms. The functional form is designed to be easy to use in conjunction with lambda expressions. The imperative form is designed to minimize copying of data. The functional form parallel_reduce(range,identity,func,reduction) performs a parallel reduction by applying func to subranges in range and reducing the results using binary operator reduction. It returns the result of the reduction. Parameter func and reduction can be lambda expressions. Table 12 summarizes the type requirements 928H on the types of identity, func, and reduction. Table 12: Requirements for Func and Reduction Pseudo-Signature Semantics Value Identity; Left identity element for Func::operator(). Value Func::operator()(const Range& range, const Value& x) Accumulate result for subrange, starting with initial value x. Algorithms Reference Manual 37 Value Reduction::operator()(const Value& x, const Value& y); Combine results x and y. The imperative form parallel_reduce(range,body) performs parallel reduction of body over each value in range. Type Range must model the Range concept (468H929H4.2). The body must model the requirements in Table 13. 469H930H Table 13: Requirements for parallel_reduce Body Pseudo-Signature Semantics Body::Body( Body&, split ); Splitting constructor (4.1). Must 470H931H be able to run concurrently with operator() and method join. Body::~Body() Destructor. void Body::operator()(const Range& range); Accumulate result for subrange. void Body::join( Body& rhs ); Join results. The result in rhs should be merged into the result of this. A parallel_reduce recursively splits the range into subranges to the point such that is_divisible() is false for each subrange. A parallel_reduce uses the splitting constructor to make one or more copies of the body for each thread. It may copy a body while the body’s operator() or method join runs concurrently. You are responsible for ensuring the safety of such concurrency. In typical usage, the safety requires no extra effort. When worker threads are available (12.2.1) 471H932H , parallel_reduce invokes the splitting constructor for the body. For each such split of the body, it invokes method join in order to merge the results from the bodies. Define join to update this to represent the accumulated result for this and rhs. The reduction operation should be associative, but does not have to be commutative. For a noncommutative operation op, “left.join(right)” should update left to be the result of left op right. A body is split only if the range is split, but the converse is not necessarily so. Figure 1 472H933H diagrams a sample execution of parallel_reduce. The root represents the original body b0 being applied to the half-open interval [0,20). The range is recursively split at each level into two subranges. The grain size for the example is 5, which yields four leaf ranges. The slash marks (/) denote where copies (b1 and b2) of the body were created by the body splitting constructor. Bodies b0 and b1 each evaluate one leaf. Body b2 evaluates leaf [10,15) and [15,20), in that order. On the way back up the tree, parallel_reduce invokes b0.join(b1) and b0.join(b2) to merge the results of the leaves. 38 315415-014US b0 [0,20) b0 [0,10) b2 [10,20) b0 [0,5) b1 [5,10) b2 [10,15) b2 [15,20) Figure 1: Execution of parallel_reduce over blocked_range(0,20,5) Figure 1 shows only one possible execution. Other valid executions include splitting b 473H934H 2 into b2 and b3, or doing no splitting at all. With no splitting, b0 evaluates each leaf in left to right order, with no calls to join. A given body always evaluates one or more subranges in left to right order. For example, in Figure 1, body b2 is guaranteed to evaluate [10,15) before [15,20). You may rely on the left to right property for a given instance of a body. However, you t must neither rely on a particular choice of body splitting nor on the subranges processed by a given body object being consecutive. parallel_reduce makes the choice of body splitting nondeterministically. b0 [0,20) b0 [0,10) b0 [10,20) b0 [0,5) b1 [5,10) b0 [10,15) b0 [15,20) Figure 2: Example Where Body b0 Processes Non-consecutive Subranges. The subranges evaluated by a given body are not consecutive if there is an intervening join. The joined information represents processing of a gap between evaluated subranges. Figure 2 shows such an example. The body b 935H 0 performs the following sequence of operations: b0( [0,5) ) b0.join()( b1 ) where b1 has already processed [5,10) b0( [10,15) ) b0( [15,20) ) In other words, body b0 gathers information about all the leaf subranges in left to right order, either by directly processing each leaf, or by a join operation on a body that gathered information about one or more leaves in a similar way. When no worker threads are available, parallel_reduce executes sequentially from left to right in the Algorithms Reference Manual 39 same sense as for parallel_for (4.4). Sequential execution never invokes the 474H936H splitting constructor or method join. All overloads can be passed a task_group_context object so that the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group 5H of its own. Complexity If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads. Example (Imperative Form) The following code sums the values in an array. #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; struct Sum { float value; Sum() : value(0) {} Sum( Sum& s, split ) {value = 0;} void operator()( const blocked_range& r ) { float temp = value; for( float* a=r.begin(); a!=r.end(); ++a ) { temp += *a; } value = temp; } void join( Sum& rhs ) {value += rhs.value;} }; float ParallelSum( float array[], size_t n ) { Sum total; parallel_reduce( blocked_range( array, array+n ), total ); return total.value; } The example generalizes to reduction for any associative operation op as follows: • Replace occurrences of 0 with the identity element for op • Replace occurrences of += with op= or its logical equivalent. 40 315415-014US • Change the name Sum to something more appropriate for op. The operation may be noncommutative. For example, op could be matrix multiplication. Example with Lambda Expressions The following is analogous to the previous example, but written using lambda expressions and the functional form of parallel_reduce. #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { return parallel_reduce( blocked_range( array, array+n ), 0.f, [](const blocked_range& r, float init)->float { for( float* a=r.begin(); a!=r.end(); ++a ) init += *a; return init; }, []( float x, float y )->float { return x+y; } ); } STL generalized numeric operations and functions objects can be used to write the example more compactly as follows: #include #include #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { return parallel_reduce( blocked_range( array, array+n ), 0.f, [](const blocked_range& r, float value)->float { return std::accumulate(r.begin(),r.end(),value); }, Algorithms Reference Manual 41 std::plus() ); } 4.6 parallel_scan Template Function Summary Template function that computes parallel prefix. Syntax template void parallel_scan( const Range& range, Body& body ); template void parallel_scan( const Range& range, Body& body, const auto_partitioner& ); template void parallel_scan( const Range& range, Body& body, const simple_partitioner& ); Header #include "tbb/parallel_scan.h" Description A parallel_scan(range,body) computes a parallel prefix, also known as parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependences. A mathematical definition of the parallel prefix is as follows. Let ? be an associative operation ? with left-identity element id?. The parallel prefix of ? over a sequence x0, x1, ...xn-1 is a sequence y0, y1, y2, ...yn-1 where: • y0 = id? ? x0 • yi = yi-1 ? xi For example, if ? is addition, the parallel prefix corresponds a running sum. A serial implementation of parallel prefix is: T temp = id?; for( int i=1; i<=n; ++i ) { temp = temp ? x[i]; y[i] = temp; 42 315415-014US } Parallel prefix performs this in parallel by reassociating the application of ? and using two passes. It may invoke ? up to twice as many times as the serial prefix algorithm. Given the right grain size and sufficient hardware threads, it can out perform the serial prefix because even though it does more work, it can distribute the work across more than one hardware thread. TIP: Because parallel_scan needs two passes, systems with only two hardware threads tend to exhibit small speedup. parallel_scan is best considered a glimpse of a technique for future systems with more than two cores. It is nonetheless of interest because it shows how a problem that appears inherently sequential can be parallelized. The template parallel_scan implements parallel prefix generically. It requires the signatures described in Table 14. 475H937H Table 14: parallel_scan Requirements Pseudo-Signature Semantics void Body::operator()( const Range& r, pre_scan_tag ) Accumulate summary for range r. void Body::operator()( const Range& r, final_scan_tag ) Compute scan result and summary for range r. Body::Body( Body& b, split ) Split b so that this and b can accumulate summaries separately. Body *this is object a in the table row below. void Body::reverse_join( Body& a ) Merge summary accumulated by a into summary accumulated by this, where this was created earlier from a by a's splitting constructor. Body *this is object b in the table row above. void Body::assign( Body& b ) Assign summary of b to this. A summary contains enough information such that for two consecutive subranges r and s: • If r has no preceding subrange, the scan result for s can be computed from knowing s and the summary for r. • A summary of r concatenated with s can be computed from the summaries of r and s. For example, if computing a running sum of an array, the summary for a range r is the sum of the array elements corresponding to r. Algorithms Reference Manual 43 Figure 3 shows one way that 938H parallel_scan might compute the running sum of an array containing the integers 1-16. Time flows downwards in the diagram. Each color denotes a separate Body object. Summaries are shown in brackets. 7. The first two steps split the original blue body into the pink and yellow bodies. Each body operates on a quarter of the input array in parallel. The last quarter is processed later in step 5. 8. The blue body computes the final scan and summary for 1-4. The pink and yellow bodies compute their summaries by prescanning 5-8 and 9-12 respectively. 9. The pink body computes its summary for 1-8 by performing a reverse_join with the blue body. 10. The yellow body computes its summary for 1-12 by performing a reverse_join with the pink body. 11. The blue, pink, and yellow bodies compute final scans and summaries for portions of the array. 12. The yellow summary is assigned to the blue body. The pink and yellow bodies are destroyed. Note that two quarters of the array were not prescanned. The parallel_scan template makes an effort to avoid prescanning where possible, to improve performance when there are only a few or no extra worker threads. If no other workers are available, parallel_scan processes the subranges without any pre_scans, by processing the subranges from left to right using final scans. That’s why final scans must compute a summary as well as the final scan result. The summary might be needed to process the next subrange if no worker thread has prescanned it yet. 44 315415-014US 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 pre_scan [26] pre_scan [42] final_scan 0 1 3 6 [10] final_scan 10 15 21 28 [36] reverse_join [36] reverse_join [78] final_scan 36 45 55 66 [78] final_scan 78 91 105 120 [136] split [0] split [0] original body [0] original body [0] assign [136] input array Figure 3: Example Execution of parallel_scan The following code demonstrates how the signatures could be implemented to use parallel_scan to compute the same result as the earlier sequential example involving ?. using namespace tbb; class Body { T sum; T* const y; const T* const x; Algorithms Reference Manual 45 public: Body( T y_[], const T x_[] ) : sum(id?), x(x_), y(y_) {} T get_sum() const {return sum;} template void operator()( const blocked_range& r, Tag ) { T temp = sum; for( int i=r.begin(); i(0,n), body ); return body.get_sum(); } The definition of operator() demonstrates typical patterns when using parallel_scan. • A single template defines both versions. Doing so is not required, but usually saves coding effort, because the two versions are usually similar. The library defines static method is_final_scan() to enable differentiation between the versions. • The prescan variant computes the ? reduction, but does not update y. The prescan is used by parallel_scan to generate look-ahead partial reductions. • The final scan variant computes the ? reduction and updates y. The operation reverse_join is similar to the operation join used by parallel_reduce, except that the arguments are reversed. That is, this is the right argument of ?. Template function parallel_scan decides if and when to generate parallel work. It is thus crucial that ? is associative and that the methods of Body faithfully represent it. Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section. 46 315415-014US If you change the example to use a simple_partitioner, be sure to provide a grainsize. The code below shows the how to do this for a grainsize of 1000: parallel_scan(blocked_range(0,n,1000), total, simple_partitioner() ); 4.6.1 pre_scan_tag and final_scan_tag Classes Summary Types that distinguish the phases of parallel_scan.. Syntax struct pre_scan_tag; struct final_scan_tag; Header #include "tbb/parallel_scan.h" Description Types pre_scan_tag and final_scan_tag are dummy types used in conjunction with parallel_scan. See the example in Section 4.6 for how they are used in the signature 939H of operator(). Members namespace tbb { struct pre_scan_tag { static bool is_final_scan(); }; struct final_scan_tag { static bool is_final_scan(); }; } 4.6.1.1 bool is_final_scan() Returns True for a final_scan_tag, otherwise false. Algorithms Reference Manual 47 4.7 parallel_do Template Function Summary Template function that processes work items in parallel. Syntax template void parallel_do( InputIterator first, InputIterator last, Body body[, task_group_context& group] ); Header #include "tbb/parallel_do.h" Description A parallel_do(first,last,body) applies a function object body over the half-open interval [first,last). Items may be processed in parallel. Additional work items can be added by body if it has a second argument of type parallel_do_feeder (4.7.1). 940H The function terminates when body(x) returns for all items x that were in the input sequence or added to it by method parallel_do_feeder::add (4.7.1.1). 941H The requirements for input iterators are specified in Section 24.1 of the ISO C++ standard. Table 15 shows the requirements on type 942H Body. Table 15: parallel_do Requirements for Body B and its Argument Type T Pseudo-Signature Semantics B::operator()( cv-qualifiers T& item, parallel_do_feeder& feeder ) const OR B::operator()(cv-qualifiers T& item ) const Process item. Template parallel_do may concurrently invoke operator() for the same this but different item. The signature with feeder permits additional work items to be added. T( const T& ) Copy a work item. ~T::T() Destroy a work item. For example, a unary function object, as defined in Section 20.3 of the C++ standard, models the requirements for B. CAUTION: Defining both the one-argument and two-argument forms of operator() is not permitted. 48 315415-014US TIP: The parallelism in parallel_do is not scalable if all of the items come from an input stream that does not have random access. To achieve scaling, do one of the following: • Use random access iterators to specify the input stream. • Design your algorithm such that the body often adds more than one piece of work. • Use parallel_for instead. To achieve speedup, the grainsize of B::operator() needs to be on the order of at least ~100,000 clock cycles. Otherwise, the internal overheads of parallel_do swamp the useful work. The algorithm can be passed a task_group_context object so that its tasks are executed in this group. By default the algorithm is executed in a bound group 6H of its own. Example The following code sketches a body with the two-argument form of operator(). struct MyBody { void operator()(item_t item, parallel_do_feeder& feeder ) { for each new piece of work implied by item do { item_t new_item = initializer; feeder.add(new_item); } } }; 4.7.1 parallel_do_feeder class Summary Inlet into which additional work items for a parallel_do can be fed. Syntax template class parallel_do_feeder; Header #include "tbb/parallel_do.h" Description A parallel_do_feeder enables the body of a parallel_do to add more work items. Algorithms Reference Manual 49 Only class parallel_do (4.7) can create or destroy a 943H parallel_do_feeder. The only operation other code can perform on a parallel_do_feeder is to invoke method parallel_do_feeder::add. Members namespace tbb { template struct parallel_do_feeder { void add( const Item& item ); }; } 4.7.1.1 void add( const Item& item ) Requirements Must be called from a call to body.operator() created by parallel_do. Otherwise, the termination semantics of method operator() are undefined. Effects Adds item to collection of work items to be processed. 4.8 parallel_for_each Template Function Summary Parallel variant of std::for_each. Syntax template void parallel_for_each (InputIterator first, InputIterator last, const Func& f [, task_group_context& group]); Header #include "tbb/parallel_for_each.h" Description A parallel_for_each(first,last,f) applies f to the result of dereferencing every iterator in the range [first,last), possibly in parallel. It is provided for PPL 50 315415-014US compatibility and equivalent to parallel_do(first,last,f) without "feeder" functionality. If the group argument is specified, the algorithm’s tasks are executed in this group. By default the algorithm is executed in a bound group 7H of its own. 4.9 pipeline Class Summary Class that performs pipelined execution. Syntax class pipeline; Header #include "tbb/pipeline.h" Description A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order (MacDonald 2004 8H ). See class filter (4.9.6) for details. 944H A pipeline contains one or more filters, denoted here as fi , where i denotes the position of the filter in the pipeline. The pipeline starts with filter f0, followed by f1, f2, etc. The following steps describe how to use class pipeline. 13. Derive each class fi from filter. The constructor for fi specifies its mode as a parameter to the constructor for base class filter (4.9.6.1). 480H945H 14. Override virtual method filter::operator() to perform the filter’s action on the item, and return a pointer to the item to be processed by the next filter. The first filter f0 generates the stream. It should return NULL if there are no more items in the stream. The return value for the last filter is ignored. 15. Create an instance of class pipeline. 16. Create instances of the filters fi and add them to the pipeline, in order from first to last. An instance of a filter can be added at most once to a pipeline. A filter should never be a member of more than one pipeline at a time. 17. Call method pipeline::run. The parameter max_number_of_live_tokens puts an upper bound on the number of stages that will be run concurrently. Higher values may increase concurrency at the expense of more memory consumption from having more items in flight. See the Tutorial, in the section on class pipeline, for more about effective use of max_number_of_live_tokens. TIP: Given sufficient processors and tokens, the throughput of the pipeline is limited to the throughput of the slowest serial filter. Algorithms Reference Manual 51 NOTE: Function parallel_pipeline 9H provides a strongly typed lambda-friendly way to build and run pipelines. Members namespace tbb { class pipeline { public: pipeline(); ~pipeline();4F 5 void add_filter( filter& f ); void run( size_t max_number_of_live_tokens [, task_group_context& group] ); void clear(); }; } 4.9.1 pipeline() Effects Constructs pipeline with no filters. 4.9.2 ~pipeline() Effects Removes all filters from the pipeline and destroys the pipeline 4.9.3 void add_filter( filter& f ) Effects Appends filter f to sequence of filters in the pipeline. The filter f must not already be in a pipeline. 5 Though the current implementation declares the destructor virtual, do not rely on this detail. The virtual nature is deprecated and may disappear in future versions of Intel® TBB. 52 315415-014US 4.9.4 void run( size_t max_number_of_live_tokens[, task_group_context& group] ) Effects Runs the pipeline until the first filter returns NULL and each subsequent filter has processed all items from its predecessor. The number of items processed in parallel depends upon the structure of the pipeline and number of available threads. At most max_number_of_live_tokens are in flight at any given time. A pipeline can be run multiple times. It is safe to add stages between runs. Concurrent invocations of run on the same instance of pipeline are prohibited. If the group argument is specified, pipeline’s tasks are executed in this group. By default the algorithm is executed in a bound group 10H of its own. 4.9.5 void clear() Effects Removes all filters from the pipeline. 4.9.6 filter Class Summary Abstract base class that represents a filter in a pipeline. Syntax class filter; Header #include "tbb/pipeline.h" Description A filter represents a filter in a pipeline (0). There are three modes of filters: 946H • A parallel filter can process multiple items in parallel and in no particular order. • A serial_out_of_order filter processes items one at a time, and in no particular order. • A serial_in_order filter processes items one at a time. All serial_in_order filters in a pipeline process items in the same order. Algorithms Reference Manual 53 The mode of filter is specified by an argument to the constructor. Parallel filters are preferred when practical because they permit parallel speedup. If a filter must be serial, the out of order variant is preferred when practical because it puts less contraints on processing order. Class filter should only be used in conjunction with class pipeline (0). 947H TIP: Use a serial_in_order input filter if there are any subsequent serial_in_order stages that should process items in their input order. CAUTION: Intel® TBB 2.0 and prior treated parallel input stages as serial. Later versions of Intel® TBB can execute a parallel input stage in parallel, so if you specify such a stage, ensure that its operator() is thread safe. Members namespace tbb { class filter { public: enum mode { parallel = implementation-defined, serial_in_order = implementation-defined, serial_out_of_order = implementation-defined }; bool is_serial() const; bool is_ordered() const; virtual void* operator()( void* item ) = 0; virtual void finalize( void* item ) {} virtual ~filter(); protected: filter( mode ); }; } Example See the example filters MyInputFilter, MyTransformFilter, and MyOutputFilter in the Tutorial (doc/Tutorial.pdf). 4.9.6.1 filter( mode filter_mode ) Effects Constructs a filter of the specified mode. NOTE: Intel® TBB 2.1 and prior had a similar constructor with a bool argument is_serial. That constructor exists but is deprecated (Section A.2.1). 948H54 315415-014US 4.9.6.2 ~filter() Effects Destroys the filter. If the filter is in a pipeline, it is automatically removed from that pipeline. 4.9.6.3 bool is_serial() const Returns False if filter mode is parallel; true otherwise. 4.9.6.4 bool is_ordered() const Returns True if filter mode is serial_in_order, false otherwise. 4.9.6.5 virtual void* operator()( void * item ) Description The derived filter should override this method to process an item and return a pointer to an item to be processed by the next filter. The item parameter is NULL for the first filter in the pipeline. Returns The first filter in a pipeline should return NULL if there are no more items to process. The result of the last filter in a pipeline is ignored. 4.9.6.6 virtual void finalize( void * item ) Description A pipeline can be cancelled by user demand or because of an exception. When a pipeline is cancelled, there may be items returned by a filter’s operator() that have not yet been processed by the next filter. When a pipeline is cancelled, the next filter invokes finalize() on each item instead of operator(). In contrast to operator(), method finalize() does not return an item for further processing. A derived filter should override finalize() to perform proper cleanup for an item. A pipeline will not invoke any further methods on the item. Effects The default definition has no effect. Algorithms Reference Manual 55 4.9.7 thread_bound_filter Class Summary Abstract base class that represents a filter in a pipeline that a thread must service explicitly. Syntax class thread_bound_filter; Header #include "tbb/pipeline.h" Description A thread_bound_filter is a special kind of filter (4.9.6) that is explicitly serviced 949H by a particular thread. It is useful when a filter must be executed by a particular thread. CAUTION: Use thread_bound_filter only if you need a filter to be executed on a particular thread. The thread that services a thread_bound_filter must not be the thread that calls pipeline::run(). Members namespace tbb { class thread_bound_filter: public filter { protected: thread_bound_filter(mode filter_mode); public: enum result_type { success, item_not_available, end_of_stream }; result_type try_process_item(); result_type process_item(); }; } Example The example below shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. #include #include "tbb/pipeline.h" #include "tbb/compat/thread" 56 315415-014US #include "tbb/task_scheduler_init.h" using namespace tbb; char InputString[] = "abcdefg\n"; class InputFilter: public filter { char* my_ptr; public: void* operator()(void*) { if (*my_ptr) return my_ptr++; else return NULL; } InputFilter() : filter( serial_in_order ), my_ptr(InputString) {} }; class OutputFilter: public thread_bound_filter { public: void* operator()(void* item) { std::cout << *(char*)item; return NULL; } OutputFilter() : thread_bound_filter(serial_in_order) {} }; void RunPipeline(pipeline* p) { p->run(8); } int main() { // Construct the pipeline InputFilter f; OutputFilter g; pipeline p; p.add_filter(f); p.add_filter(g); // Another thread initiates execution of the pipeline std::thread t(RunPipeline,&p); // Process the thread_bound_filter with the current thread. Algorithms Reference Manual 57 while (g.process_item()!=thread_bound_filter::end_of_stream) continue; // Wait for pipeline to finish on the other thread. t.join(); return 0; } The main thread does the following after constructing the pipeline: 18. Start the pipeline on another thread. 19. Service the thread_bound_filter until it reaches end_of_stream. 20. Wait for the other thread to finish. The pipeline is run on a separate thread because the main thread is responsible for servicing the thread_bound_filter g. The roles of the two threads can be reversed. A single thread cannot do both roles. 4.9.7.1 thread_bound_filter(mode filter_mode) Effects Constructs a filter of the specified mode. Section 4.9.6 describes the modes. 950H 4.9.7.2 result_type try_process_item() Effects If an item is available and it can be processed without exceeding the token limit, process the item with filter::operator(). Returns Table 16: Return Values From try_process_item Return Value Description success Applied filter::operator() to one item. item_not_available No item is currently available to process, or the token limit (4.9.4) would be exceeded. 951H end_of_stream No more items will ever arrive at this filter. 58 315415-014US 4.9.7.3 result_type process_item() Effects Like try_process_item, but waits until it can process an item or the end of the stream is reached. Returns Either success or end_of_stream. See Table 16 for details. 952H CAUTION: The current implementation spin waits until it can process an item or reaches the end of the stream. 4.10 parallel_pipeline Function Summary Strongly typed interface for pipelined execution. Syntax void parallel_pipeline( size_t max_number_of_live_tokens, const filter_t& filter_chain [, task_group_context& group] ); Header #include "tbb/pipeline.h" Description Function parallel_pipeline is a strongly typed lambda-friendly interface for building and running pipelines. The pipeline has characteristics similar to class pipeline 11H , except that the stages of the pipeline are specified via functors instead of class derivation. To build and run a pipeline from functors g0, g1, g2,...gn , write: parallel_pipeline( max_number_of_live_tokens, make_filter(mode0,g0) & make_filter(mode1,g1) & make_filter(mode2,g2) & ... make_filter(moden,gn) ); In general, functor gi should define its operator() to map objects of type Ii to objects of type Ii+1. Functor g0 is a special case, because it notifies the pipeline when the end of the input stream is reached. Functor g0 must be defined such that for a flow_control Algorithms Reference Manual 59 object fc, the expression g0(fc) either returns the next value in the input stream, or if at the end of the input stream, invokes fc.stop() and returns a dummy value. The value max_number_of_live_tokens has the same meaning as it does for pipeline::run 12H . If the group argument is specified, pipeline’s tasks are executed in this group. By default the algorithm is executed in a bound group 13H of its own. Example The following example uses parallel_pipeline compute the root-mean-square of a sequence defined by [first,last). The example is only for demonstrating syntactic mechanics. It is not as a practical way to do the calculation because parallel overhead would be vastly higher than useful work. Operator & requires that the output type of its first filter_t argument matches the input type of its second filter_t argument. float RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_live_token=*/16, make_filter( filter::serial, [&](flow_control& fc)-> float*{ if( first( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter( filter::serial, [&](float x) {sum+=x;} ) ); return sqrt(sum); } See the Intel® Threading Building Blocks Tutorial for a non-trivial example of parallel_pipeline. 60 315415-014US 4.10.1 filter_t Template Class Summary A filter or composite filter used in conjunction with function parallel_pipeline. Syntax template class filter_t; template filter_t make_filter( filter::mode mode, const Func& f ); template filter_t operator&( const filter_t& left, const filter_t& right ); Header #include "tbb/pipeline.h" Description A filter_t is a strongly typed filter that specifies its input and output types. A filter_t can be constructed from a functor or by composing of two filter_t objects with operator&. See 4.4 for an example. The same 14H953H filter_t object can be shared by multiple & expressions. Members namespace tbb { template class filter_t { public: filter_t(); filter_t( const filter_t& rhs ); template filter_t( filter::mode mode, const Func& func ); void operator=( const filter_t& rhs ); ~filter_t(); void clear(); }; template filter_t make_filter( filter::mode mode, const Func& f ); template filter_t operator&( const filter_t& left, const filter_t& right ); } Algorithms Reference Manual 61 4.10.1.1 filter_t() Effects Construct an undefined filter. CAUTION: The effect of using an undefined filter by operator& or parallel_pipeline is undefined. 4.10.1.2 filter_t( const filter_t& rhs ) Effects Construct a copy of rhs. 4.10.1.3 template filter_t( filter::mode mode, const Func& f ) Effects Construct a filter_t that uses a copy of functor f to map an input value t of type T to an output value u of type U. NOTE: When parallel_pipeline uses the filter_t, it computes u by evaluating f(t), unless T is void. In the void case u is computed by the expression u=f(fc), where fc is of type flow_control. See 4.9.6 for a description of the 15H954H mode argument. 4.10.1.4 void operator=( const filter_t& rhs ) Effects Update *this to use the functor associated with rhs. 4.10.1.5 ~filter_t() Effects Destroy the filter_t. 4.10.1.6 void clear() Effects Set *this to an undefined filter. 62 315415-014US 4.10.1.7 template filter_t make_filter(filter::mode mode, const Func& f) Returns filter_t(mode,f) 4.10.1.8 template filter_t operator& (const filter_t& left, const filter_t& right) Requires The output type of left must match the input type of right. Returns A filter_t representing the composition of filters left and right. The composition behaves as if the output value of left becomes the input value of right. 4.10.2 flow_control Class class flow_control; Summary Enables the first filter in a composite filter to indicate when the end of input has been reached. Syntax class flow_control; Header #include "tbb/pipeline.h" Description Template function parallel_pipeline passes a flow_control object fc to the input functor of a filter_t. When the input functor reaches the end of its input, it should invoke fc.stop() and return a dummy value. See 4.4 for an example. 16H955H Members namespace tbb { class flow_control { public: void stop(); Algorithms Reference Manual 63 }; } 4.11 parallel_sort Template Function Summary Sort a sequence. Syntax template void parallel_sort(RandomAccessIterator begin, RandomAccessIterator end); template void parallel_sort(RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp ); Header #include "tbb/parallel_sort.h" Description Performs an unstable sort of sequence [begin1, end1). An unstable sort might not preserve the relative ordering of elements with equal keys. The sort is deterministic; sorting the same sequence will produce the same result each time. The requirements on the iterator and sequence are the same as for std::sort. Specifically, RandomAccessIterator must be a random access iterator, and its value type T must model the requirements in Table 17. 483H956H Table 17: Requirements on Value Type T of RandomAccessIterator for parallel_sort Pseudo-Signature Semantics void swap( T& x, T& y ) Swap x and y. bool Compare::operator()( const T& x, const T& y ) True if x comes before y; false otherwise. A call parallel_sort(i,j,comp) sorts the sequence [i,j) using the argument comp to determine relative orderings. If comp(x,y) returns true then x appears before y in the sorted sequence. A call parallel_sort(i,j) is equivalent to parallel_sort(i,j,std::less). 64 315415-014US Complexity parallel_sort is comparison sort with an average time complexity of O(N log (N)), where N is the number of elements in the sequence. When worker threads are available (12.2.1) 484H957H , parallel_sort creates subtasks that may be executed concurrently, leading to improved execution times. Example The following example shows two sorts. The sort of array a uses the default comparison, which sorts in ascending order. The sort of array b sorts in descending order by using std::greater for comparison. #include "tbb/parallel_sort.h" #include using namespace tbb; const int N = 100000; float a[N]; float b[N]; void SortExample() { for( int i = 0; i < N; i++ ) { a[i] = sin((double)i); b[i] = cos((double)i); } parallel_sort(a, a + N); parallel_sort(b, b + N, std::greater()); } 4.12 parallel_invoke Template Function Summary Template function that evaluates several functions in parallel. Syntax5F 6 template 6 When support for C++0x rvalue references become prevalent, the formal parameters may change to rvalue references. Algorithms Reference Manual 65 void parallel_invoke(const Func0& f0, const Func1& f1); template void parallel_invoke(const Func0& f0, const Func1& f1, const Func2& f2); … template void parallel_invoke(const Func0& f0, const Func1& f1 … const Func9& f9); Header #include "tbb/parallel_invoke.h" Description The expression parallel_invoke(f0,f1...fk) evaluates f0(), f1(),...fk possibly in parallel. There can be from 2 to 10 arguments. Each argument must have a type for which operator() is defined. Typically the arguments are either function objects or pointers to functions. Return values are ignored. Example The following example evaluates f(), g(), and h() in parallel. Notice how g and h are function objects that can hold local state. #include "tbb/parallel_invoke.h" using namespace tbb; void f(); extern void bar(int); class MyFunctor { int arg; public: MyFunctor(int a) : arg(a) {} void operator()() const {bar(arg);} }; void RunFunctionsInParallel() { MyFunctor g(2); MyFunctor h(3); tbb::parallel_invoke(f, g, h ); } 66 315415-014US Example with Lambda Expressions Here is the previous example rewritten with C++0x lambda expressions, which generate function objects. #include "tbb/parallel_invoke.h" using namespace tbb; void f(); extern void bar(int); void RunFunctionsInParallel() { tbb::parallel_invoke(f, []{bar(2);}, []{bar(3);} ); } Containers Reference Manual 67 5 Containers The container classes permit multiple threads to simultaneously invoke certain methods on the same container. Like STL, Intel® Threading Building Blocks (Intel® TBB) containers are templated with respect to an allocator argument. Each container uses its allocator to allocate memory for user-visible items. A container may use a different allocator for strictly internal structures. 5.1 Container Range Concept Summary View set of items in a container as a recursively divisible range. Requirements A Container Range is a Range (4.2) with the further requirements listed in 958H Table 18 959H . Table 18: Requirements on a Container Range R (In Addition to Table 8) 960H Pseudo-Signature Semantics R::value_type Item type R::reference Item reference type R::const_reference Item const reference type R::difference_type Type for difference of two iterators R::iterator Iterator type for range R::iterator R::begin() First item in range R::iterator R::end() One past last item in range R::size_type R::grainsize() const Grain size Model Types Classes concurrent_hash_map (5.4.4) and 961H concurrent_vector (5.8.5) both have 962H member types range_type and const_range_type that model a Container Range. Use the range types in conjunction with parallel_for (4.4), 497H963H parallel_reduce (4.5), 498H964H and parallel_scan (4.499H965H 4.6) to iterate over items in a container. 6966H68 315415-014US 5.2 concurrent_unordered_map Template Class Summary Template class for associative container that supports concurrent insertion and traversal. Syntax template , typename Equality = std::equal_to, typename Allocator = tbb::tbb_allocator > > class concurrent_unordered_map; Header #include "tbb/concurrent_unordered_map.h" Description A concurrent_unordered_map supports concurrent insertion and traversal, but not concurrent erasure. The interface has no visible locking. It may hold locks internally, but never while calling user defined code. It has semantics similar to the C++0x std::unordered_map except as follows: • Methods requiring C++0x language features (such as rvalue references and std::initializer_list) are currently omitted. • The erase methods are prefixed with unsafe_, to indicate that they are not concurrency safe. • Bucket methods are prefixed with unsafe_ as a reminder that they are not concurrency safe with respect to insertion. • The insert methods may create a temporary pair that is destroyed if another thread inserts the same key concurrently. • Like std::list, insertion of new items does not invalidate any iterators, nor change the order of items already in the map. Insertion and traversal may be concurrent. • The iterator types iterator and const_iterator are of the forward iterator category. • Insertion does not invalidate or update the iterators returned by equal_range, so insertion may cause non-equal items to be inserted at the end of the range. However, the first iterator will nonethless point to the equal item even after an insertion operation. Containers Reference Manual 69 NOTE: The key differences between classes concurrent_unordered_map and concurrent_hash_map each are: • concurrent_unordered_map: permits concurrent traversal and insertion, no visible locking, closely resembles the C++0x unordered_map. • concurrent_hash_map: permits concurrent erasure, built-in locking CAUTION: As with any form of hash table, keys that are equal must have the same hash code, and the ideal hash function distributes keys uniformly across the hash code space. Members In the following synopsis, methods in bold may be concurrently invoked. For example, three different threads can concurrently call methods insert, begin, and size. Their results might be non-deterministic. For example, the result from size might correspond to before or after the insertion. template , typename Equal = std::equal_to, typename Allocator = tbb::tbb_allocator > > class concurrent_unordered_map { public: // types typedef Key key_type; typedef std::pair value_type; typedef Element mapped_type; typedef Hash hasher; typedef Equality key_equal; typedef Alloc allocator_type; typedef typename allocator_type::pointer pointer; typedef typename allocator_type::const_pointer const_pointer; typedef typename allocator_type::reference reference; typedef typename allocator_type::const_reference const_reference; typedef implementation-defined size_type; typedef implementation-defined difference_type; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined local_iterator; typedef implementation-defined const_local_iterator; // construct/destroy/copy explicit concurrent_unordered_map(size_type n = implementation-defined, 70 315415-014US const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); template concurrent_unordered_map( InputIterator first, InputIterator last, size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); concurrent_unordered_map(const concurrent_unordered_map&); concurrent_unordered_map(const Alloc&); concurrent_unordered_map(const concurrent_unordered_map&, const Alloc&); ~concurrent_unordered_map(); concurrent_unordered_map& operator=( const concurrent_unordered_map&); allocator_type get_allocator() const; // size and capacity bool empty() const; // May take linear time! size_type size() const; // May take linear time! size_type max_size() const; // iterators iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; const_iterator cbegin() const; const_iterator cend() const; // modifiers std::pair insert(const value_type& x); iterator insert(const_iterator hint, const value_type& x); template void insert(InputIterator first, InputIterator last); iterator unsafe_erase(const_iterator position); size_type unsafe_erase(const key_type& k); iterator unsafe_erase(const_iterator first, const_iterator last); void clear(); void swap(concurrent_unordered_map&); Containers Reference Manual 71 // observers hasher hash_function() const; key_equal key_eq() const; // lookup iterator find(const key_type& k); const_iterator find(const key_type& k) const; size_type count(const key_type& k) const; std::pair equal_range(const key_type& k); std::pair equal_range(const key_type& k) const; mapped_type& operator[](const key_type& k); mapped_type& at( const key_type& k ); const mapped_type& at(const key_type& k) const; // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range(); const_range_type range() const; // bucket interface – for debugging size_type unsafe_bucket_count() const; size_type unsafe_max_bucket_count() const; size_type unsafe_bucket_size(size_type n); size_type unsafe_bucket(const key_type& k) const; local_iterator unsafe_begin(size_type n); const_local_iterator unsafe_begin(size_type n) const; local_iterator unsafe_end(size_type n); const_local_iterator unsafe_end(size_type n) const; const_local_iterator unsafe_cbegin(size_type n) const; const_local_iterator unsafe_cend(size_type n) const; // hash policy float load_factor() const; float max_load_factor() const; void max_load_factor(float z); void rehash(size_type n); }; 72 315415-014US 5.2.1 Construct, Destroy, Copy 5.2.1.1 explicit concurrent_unordered_map (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct empty table with n buckets. 5.2.1.2 template concurrent_unordered_map (InputIterator first, InputIterator last, size_type n = implementationdefined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct table with n buckets initialized with value_type(*i) where i is in the half open interval [first,last). 5.2.1.3 concurrent_unordered_map(const unordered_map& m) Effects Construct copy of map m. 5.2.1.4 concurrent_unordered_map(const Alloc& a) Construct empy map using allocator a. 5.2.1.5 concurrent_unordered_map(const unordered_map&, const Alloc& a) Effects Construct copy of map m using allocator a. Containers Reference Manual 73 5.2.1.6 ~concurrent_unordered_map() Effects Destroy the map. 5.2.1.7 concurrent_ unordered_map& operator=(const concurrent_unordered_map& m); Effects Set *this to a copy of map m. 5.2.1.8 allocator_type get_allocator() const; Get copy of the allocator associated with *this. 5.2.2 Size and capacity 5.2.2.1 bool empty() const Returns size()!=0. 5.2.2.2 size_type size() const Returns Number of items in *this. CAUTION: Though the current implementation takes time O(1), possible future implementations might take time O(P), where P is the number of hardware threads. 5.2.2.3 size_type max_size() const Returns CAUTION: Upper bound on number of items that *this can hold. CAUTION: The upper bound may be much higher than what the container can actually hold. 5.2.3 Iterators Template class concurrent_unordered_map supports forward iterators; that is, iterators that can advance only forwards across a table. Reverse iterators are not 74 315415-014US supported. Concurrent operations (count, find, insert) do not invalidate any existing iterators that point into the table. Note that an iterator obtained via begin() will no longer point to the first item if insert inserts an item before it. Methods cbegin and cend follow C++0x conventions. They return const_iterator even if the object is non-const. 5.2.3.1 iterator begin() Returns iterator pointing to first item in the map. 5.2.3.2 const_iterator begin() const Returns const_iterator pointing to first item in in the map. 5.2.3.3 iterator end() Returns iterator pointing to immediately past last item in the map. 5.2.3.4 const_iterator end() const Returns const_iterator pointing to immediately past last item in the map. 5.2.3.5 const_iterator cbegin() const Returns const_iterator pointing to first item in the map. 5.2.3.6 const_iterator cend() const Returns const_iterator pointing to immediately after the last item in the map. Containers Reference Manual 75 5.2.4 Modifiers 5.2.4.1 std::pair insert(const value_type& x) Effects Constructs copy of x and attempts to insert it into the map. Destroys the copy if the attempt fails because there was already an item with the same key. Returns std::pair(iterator,success). The value iterator points to an item in the map with a matching key. The value of success is true if the item was inserted; false otherwise. 5.2.4.2 iterator insert(const_iterator hint, const value_type& x) Effects Same as insert(x). NOTE: The current implementation ignores the hint argument. Other implementations might not ignore it. It exists for similarity with the C++0x class unordered_map. It hints to the implementation about where to start searching. Typically it should point to an item adjacent to where the item will be inserted. Returns Iterator pointing to inserted item, or item already in the map with the same key. 5.2.4.3 template void insert(InputIterator first, InputIterator last) Effects Does insert(*i) where i is in the half-open interval [first,last). 5.2.4.4 iterator unsafe_erase(const_iterator position) Effects Remove item pointed to by position from the map. Returns Iterator pointing to item that was immediately after the erased item, or end() if erased item was the last item in the map. 76 315415-014US 5.2.4.5 size_type unsafe_erase(const key_type& k) Effects Remove item with key k if such an item exists. Returns 1 if an item was removed; 0 otherwise. 5.2.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) Effects Remove *i where i is in the half-open interval [first,last). Returns last 5.2.4.7 void clear() Effects Remove all items from the map. 5.2.4.8 void swap(concurrent_unordered_map& m) Effects Swap contents of *this and m. 5.2.5 Observers 5.2.5.1 hasher hash_function() const Returns Hashing functor associated with the map. 5.2.5.2 key_equal key_eq() const Returns Key equivalence functor associcated with the map. Containers Reference Manual 77 5.2.6 Lookup 5.2.6.1 iterator find(const key_type& k) Returns iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.2.6.2 const_iterator find(const key_type& k) const Returns const_iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.2.6.3 size_type count(const key_type& k) const Returns Number of items with keys equivalent to k. 5.2.6.4 std::pair equal_range(const key_type& k) Returns Range containing all keys in the map that are equivalent to k. 5.2.6.5 std::pair equal_range(const key_type& k) const Returns Range containing all keys in the map that are equivalent to k. 5.2.6.6 mapped_type& operator[](const key_type& k) Effects Inserts a new item if item with key equivalent to k is not already present. Returns Reference to x.second, where x is item in map with key equivalent to k. 78 315415-014US 5.2.6.7 mapped_type& at( const key_type& k ) Effects Throws exception if item with key equivalent to k is not already present. Returns Reference to x.second, where x is the item in map with key equivalent to k. 5.2.6.8 const mapped_type& at(const key_type& k) const Effects Throws exception if item with key equivalent to k is not already present. Returns Const reference to x.second, where x is the item in map with key equivalent to k. 5.2.7 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 967H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.2.7.1 const_range_type range() const Returns const_range_type object representing all keys in the table. 5.2.7.2 range_type range() Returns range_type object representing all keys in the table. 5.2.8 Bucket Interface The bucket interface is intended for debugging. It is not concurrency safe. The mapping of keys to buckets is implementation specific. The interface is similar to the bucket interface for the C++0x class unordered_map, except that the prefix unsafe_ has been added as a reminder that the methods are unsafe to use during concurrent insertion. Containers Reference Manual 79 Buckets are numbered from 0 to unsafe_bucket_count()-1. To iterate over a bucket use a local_iterator or const_local_iterator. 5.2.8.1 size_type unsafe_bucket_count() const Returns Number of buckets. 5.2.8.2 size_type unsafe_max_bucket_count() const Returns Upper bound on possible number of buckets. 5.2.8.3 size_type unsafe_bucket_size(size_type n) Returns Number of items in bucket n. 5.2.8.4 size_type unsafe_bucket(const key_type& k) const Returns Index of bucket where item with key k would be placed. 5.2.8.5 local_iterator unsafe_begin(size_type n) Returns local_iterator pointing to first item in bucket n. 5.2.8.6 const_local_iterator unsafe_begin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.2.8.7 local_iterator unsafe_end(size_type n) Returns local_iterator pointing to immediately after the last item in bucket n.80 315415-014US 5.2.8.8 const_local_iterator unsafe_end(size_type n) const Returns const_local_iterator pointing to immediately after the last item in bucket n. 5.2.8.9 const_local_iterator unsafe_cbegin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.2.8.10 const_local_iterator unsafe_cend(size_type n) const Returns const_local_iterator pointing to immediately past last item in bucket n. 5.2.9 Hash policy 5.2.9.1 float load_factor() const Returns Average number of elements per bucket. 5.2.9.2 float max_load_factor() const Returns Maximum size of a bucket. If insertion of an item causes a bucket to be bigger, the implementaiton may repartition or increase the number of buckets. 5.2.9.3 void max_load_factor(float z) Effects Set maximum size for a bucket to z. 5.2.9.4 void rehash(size_type n) Requirements n must be a power of two. Containers Reference Manual 81 Effects No effect if current number of buckets is at least n. Otherwise increases number of buckets to n. 5.3 concurrent_unordered_set Template Class Summary Template class for a set container that supports concurrent insertion and traversal. Syntax template , typename Equality = std::equal_to, typename Allocator = tbb::tbb_allocator class concurrent_unordered_set; Header #include "tbb/concurrent_unordered_set.h" Description A concurrent_unordered_set supports concurrent insertion and traversal, but not concurrent erasure. The interface has no visible locking. It may hold locks internally, but never while calling user defined code. It has semantics similar to the C++0x std::unordered_set except as follows: • Methods requiring C++0x language features (such as rvalue references and std::initializer_list) are currently omitted. • The erase methods are prefixed with unsafe_, to indicate that they are not concurrency safe. • Bucket methods are prefixed with unsafe_ as a reminder that they are not concurrency safe with respect to insertion. • The insert methods may create a temporary pair that is destroyed if another thread inserts the same key concurrently. • Like std::list, insertion of new items does not invalidate any iterators, nor change the order of items already in the set. Insertion and traversal may be concurrent. • The iterator types iterator and const_iterator are of the forward iterator category. • Insertion does not invalidate or update the iterators returned by equal_range, so insertion may cause non-equal items to be inserted at the end of the range. 82 315415-014US However, the first iterator will nonethless point to the equal item even after an insertion operation. CAUTION: As with any form of hash table, keys that are equal must have the same hash code, and the ideal hash function distributes keys uniformly across the hash code space. Members In the following synopsis, methods in bold may be concurrently invoked. For example, three different threads can concurrently call methods insert, begin, and size. Their results might be non-deterministic. For example, the result from size might correspond to before or after the insertion. template , typename Equal = std::equal_to, typename Allocator = tbb::tbb_allocator class concurrent_unordered_set { public: // types typedef Key key_type; typedef Key value_type; typedef Key mapped_type; typedef Hash hasher; typedef Equality key_equal; typedef Alloc allocator_type; typedef typename allocator_type::pointer pointer; typedef typename allocator_type::const_pointer const_pointer; typedef typename allocator_type::reference reference; typedef typename allocator_type::const_reference const_reference; typedef implementation-defined size_type; typedef implementation-defined difference_type; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined local_iterator; typedef implementation-defined const_local_iterator; // construct/destroy/copy explicit concurrent_unordered_set(size_type n = implementation-defined, const Hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); template concurrent_unordered_set( InputIterator first, InputIterator last, Containers Reference Manual 83 size_type n = implementation-defined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()); concurrent_unordered_set(const concurrent_unordered_set&); concurrent_unordered_set(const Alloc&); concurrent_unordered_set(const concurrent_unordered_set&, const Alloc&); ~concurrent_unordered_set(); concurrent_unordered_set& operator=( const concurrent_unordered_set&); allocator_type get_allocator() const; // size and capacity bool empty() const; // May take linear time! size_type size() const; // May take linear time! size_type max_size() const; // iterators iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; const_iterator cbegin() const; const_iterator cend() const; // modifiers std::pair insert(const value_type& x); iterator insert(const_iterator hint, const value_type& x); template void insert(InputIterator first, InputIterator last); iterator unsafe_erase(const_iterator position); size_type unsafe_erase(const key_type& k); iterator unsafe_erase(const_iterator first, const_iterator last); void clear(); void swap(concurrent_unordered_set&); // observers hasher hash_function() const; key_equal key_eq() const; // lookup 84 315415-014US iterator find(const key_type& k); const_iterator find(const key_type& k) const; size_type count(const key_type& k) const; std::pair equal_range(const key_type& k); std::pair equal_range(const key_type& k) const; // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range(); const_range_type range() const; // bucket interface – for debugging size_type unsafe_bucket_count() const; size_type unsafe_max_bucket_count() const; size_type unsafe_bucket_size(size_type n); size_type unsafe_bucket(const key_type& k) const; local_iterator unsafe_begin(size_type n); const_local_iterator unsafe_begin(size_type n) const; local_iterator unsafe_end(size_type n); const_local_iterator unsafe_end(size_type n) const; const_local_iterator unsafe_cbegin(size_type n) const; const_local_iterator unsafe_cend(size_type n) const; // hash policy float load_factor() const; float max_load_factor() const; void max_load_factor(float z); void rehash(size_type n); }; 5.3.1 Construct, Destroy, Copy 5.3.1.1 explicit concurrent_unordered_set (size_type n = implementation-defined, const hasher& hf = hasher(),const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct empty set with n buckets. Containers Reference Manual 85 5.3.1.2 template concurrent_unordered_set (InputIterator first, InputIterator last, size_type n = implementationdefined, const hasher& hf = hasher(), const key_equal& eql = key_equal(), const allocator_type& a = allocator_type()) Effects Construct set with n buckets initialized with value_type(*i) where i is in the half open interval [first,last). 5.3.1.3 concurrent_unordered_set(const unordered_set& m) Effects Construct copy of set m. 5.3.1.4 concurrent_unordered_set(const Alloc& a) Construct empy set using allocator a. 5.3.1.5 concurrent_unordered_set(const unordered_set&, const Alloc& a) Effects Construct copy of set m using allocator a. 5.3.1.6 ~concurrent_unordered_set() Effects Destroy the set. 5.3.1.7 concurrent_ unordered_set& operator=(const concurrent_unordered_set& m); Effects Set *this to a copy of set m. 5.3.1.8 allocator_type get_allocator() const; Get copy of the allocator associated with *this. 86 315415-014US 5.3.2 Size and capacity 5.3.2.1 bool empty() const Returns size()!=0. 5.3.2.2 size_type size() const Returns Number of items in *this. CAUTION: Though the current implementation takes time O(1), possible future implementations might take time O(P), where P is the number of hardware threads. 5.3.2.3 size_type max_size() const Returns CAUTION: Upper bound on number of items that *this can hold. CAUTION: The upper bound may be much higher than what the container can actually hold. 5.3.3 Iterators Template class concurrent_unordered_set supports forward iterators; that is, iterators that can advance only forwards across a set. Reverse iterators are not supported. Concurrent operations (count, find, insert) do not invalidate any existing iterators that point into the set. Note that an iterator obtained via begin() will no longer point to the first item if insert inserts an item before it. Methods cbegin and cend follow C++0x conventions. They return const_iterator even if the object is non-const. 5.3.3.1 iterator begin() Returns iterator pointing to first item in the set. Containers Reference Manual 87 5.3.3.2 const_iterator begin() const Returns const_iterator pointing to first item in in the set. 5.3.3.3 iterator end() Returns iterator pointing to immediately past last item in the set. 5.3.3.4 const_iterator end() const Returns const_iterator pointing to immediately past last item in the set. 5.3.3.5 const_iterator cbegin() const Returns const_iterator pointing to first item in the set. 5.3.3.6 const_iterator cend() const Returns const_iterator pointing to immediately after the last item in the set. 5.3.4 Modifiers 5.3.4.1 std::pair insert(const value_type& x) Effects Constructs copy of x and attempts to insert it into the set. Destroys the copy if the attempt fails because there was already an item with the same key. Returns std::pair(iterator,success). The value iterator points to an item in the set with a matching key. The value of success is true if the item was inserted; false otherwise. 88 315415-014US 5.3.4.2 iterator insert(const_iterator hint, const value_type& x) Effects Same as insert(x). NOTE: The current implementation ignores the hint argument. Other implementations might not ignore it. It exists for similarity with the C++0x class unordered_set. It hints to the implementation about where to start searching. Typically it should point to an item adjacent to where the item will be inserted. Returns Iterator pointing to inserted item, or item already in the set with the same key. 5.3.4.3 template void insert(InputIterator first, InputIterator last) Effects Does insert(*i) where i is in the half-open interval [first,last). 5.3.4.4 iterator unsafe_erase(const_iterator position) Effects Remove item pointed to by position from the set. Returns Iterator pointing to item that was immediately after the erased item, or end() if erased item was the last item in the set. 5.3.4.5 size_type unsafe_erase(const key_type& k) Effects Remove item with key k if such an item exists. Returns 1 if an item was removed; 0 otherwise. Containers Reference Manual 89 5.3.4.6 iterator unsafe_erase(const_iterator first, const_iterator last) Effects Remove *i where i is in the half-open interval [first,last). Returns last 5.3.4.7 void clear() Effects Remove all items from the set. 5.3.4.8 void swap(concurrent_unordered_set& m) Effects Swap contents of *this and m. 5.3.5 Observers 5.3.5.1 hasher hash_function() const Returns Hashing functor associated with the set. 5.3.5.2 key_equal key_eq() const Returns Key equivalence functor associcated with the set. 5.3.6 Lookup 5.3.6.1 iterator find(const key_type& k) Returns iterator pointing to item with key equivalent to k, or end() if no such item exists. 90 315415-014US 5.3.6.2 const_iterator find(const key_type& k) const Returns const_iterator pointing to item with key equivalent to k, or end() if no such item exists. 5.3.6.3 size_type count(const key_type& k) const Returns Number of items with keys equivalent to k. 5.3.6.4 std::pair equal_range(const key_type& k) Returns Range containing all keys in the set that are equivalent to k. 5.3.6.5 std::pair equal_range(const key_type& k) const Returns Range containing all keys in the set that are equivalent to k. 5.3.7 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 968H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.3.7.1 const_range_type range() const Returns const_range_type object representing all keys in the set. 5.3.7.2 range_type range() Returns range_type object representing all keys in the set. Containers Reference Manual 91 5.3.8 Bucket Interface The bucket interface is intended for debugging. It is not concurrency safe. The mapping of keys to buckets is implementation specific. The interface is similar to the bucket interface for the C++0x class unordered_set, except that the prefix unsafe_ has been added as a reminder that the methods are unsafe to use during concurrent insertion. Buckets are numbered from 0 to unsafe_bucket_count()-1. To iterate over a bucket use a local_iterator or const_local_iterator. 5.3.8.1 size_type unsafe_bucket_count() const Returns Number of buckets. 5.3.8.2 size_type unsafe_max_bucket_count() const Returns Upper bound on possible number of buckets. 5.3.8.3 size_type unsafe_bucket_size(size_type n) Returns Number of items in bucket n. 5.3.8.4 size_type unsafe_bucket(const key_type& k) const Returns Index of bucket where item with key k would be placed. 5.3.8.5 local_iterator unsafe_begin(size_type n) Returns local_iterator pointing to first item in bucket n. 5.3.8.6 const_local_iterator unsafe_begin(size_type n) const Returns const_local_iterator pointing to first item in bucket n.92 315415-014US 5.3.8.7 local_iterator unsafe_end(size_type n) Returns local_iterator pointing to immediately after the last item in bucket n. 5.3.8.8 const_local_iterator unsafe_end(size_type n) const Returns const_local_iterator pointing to immediately after the last item in bucket n. 5.3.8.9 const_local_iterator unsafe_cbegin(size_type n) const Returns const_local_iterator pointing to first item in bucket n. 5.3.8.10 const_local_iterator unsafe_cend(size_type n) const Returns const_local_iterator pointing to immediately past last item in bucket n. 5.3.9 Hash policy 5.3.9.1 float load_factor() const Returns Average number of elements per bucket. 5.3.9.2 float max_load_factor() const Returns Maximum size of a bucket. If insertion of an item causes a bucket to be bigger, the implementaiton may repartition or increase the number of buckets. 5.3.9.3 void max_load_factor(float z) Effects Set maximum size for a bucket to z. Containers Reference Manual 93 5.3.9.4 void rehash(size_type n) Requirements n must be a power of two. Effects No effect if current number of buckets is at least n. Otherwise increases number of buckets to n. 5.4 concurrent_hash_map Template Class Summary Template class for associative container with concurrent access. Syntax template, typename A=tbb_allocator > > class concurrent_hash_map; Header #include "tbb/concurrent_hash_map.h" Description A concurrent_hash_map maps keys to values in a way that permits multiple threads to concurrently access values. The keys are unordered. There is at most one element in a concurrent_hash_map for each key. The key may have other elements in flight but not in the map as described in Section 5.4.3. The interface resembles typical STL 969H associative containers, but with some differences critical to supporting concurrent access. It meets the Container Requirements of the ISO C++ standard. Types Key and T must model the CopyConstructible concept (2.2.3). 485H970H Type HashCompare specifies how keys are hashed and compared for equality. It must model the HashCompare concept in Table 19. 971H Table 19: HashCompare Concept Pseudo-Signature Semantics HashCompare::HashCompare( const HashCompare& ) Copy constructor. 94 315415-014US Pseudo-Signature Semantics HashCompare::~HashCompare () Destructor. bool HashCompare::equal( const Key& j, const Key& k ) const True if keys are equal. size_t HashCompare::hash( const Key& k ) const Hashcode for key. CAUTION: As for most hash tables, if two keys are equal, they must hash to the same hash code. That is for a given HashCompare h and any two keys j and k, the following assertion must hold: “!h.equal(j,k) || h.hash(j)==h.hash(k)”. The importance of this property is the reason that concurrent_hash_map makes key equality and hashing function travel together in a single object instead of being separate objects. The hash code of a key must not change while the hash table is non-empty. CAUTION: Good performance depends on having good pseudo-randomness in the low-order bits of the hash code. Example When keys are pointers, simply casting the pointer to a hash code may cause poor performance because the low-order bits of the hash code will be always zero if the pointer points to a type with alignment restrictions. A way to remove this bias is to divide the casted pointer by the size of the type, as shown by the underlined blue text below. size_t MyHashCompare::hash( Key* key ) const { return reinterpret_cast(key)/sizeof(Key); } Members namespace tbb { template > > class concurrent_hash_map { public: // types typedef Key key_type; typedef T mapped_type; typedef std::pair value_type; typedef size_t size_type; typedef ptrdiff_t difference_type; typedef value_type* pointer; typedef const value_type* const_pointer; typedef value_type& reference; typedef Alloc allocator_type; // whole-table operations Containers Reference Manual 95 concurrent_hash_map( const allocator_type& a=allocator_type() ); concurrent_hash_map( size_type n, const allocator_type &a = allocator_type() ); concurrent_hash_map( const concurrent_hash_map&, const allocator_type& a=allocator_type() ); template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type()) ~concurrent_hash_map(); concurrent_hash_map operator=(const concurrent_hash_map&); void rehash( size_type n=0 ); void clear(); allocator_type get_allocator() const; // concurrent access class const_accessor; class accessor; // concurrent operations on a table bool find( const_accessor& result, const Key& key ) const; bool find( accessor& result, const Key& key ); bool insert( const_accessor& result, const Key& key ); bool insert( accessor& result, const Key& key ); bool insert( const_accessor& result, const value_type& value ); bool insert( accessor& result, const value_type& value ); bool insert( const value_type& value ); template void insert( I first, I last ); bool erase( const Key& key ); bool erase( const_accessor& item_accessor ); bool erase( accessor& item_accessor ); // parallel iteration typedef implementation defined range_type; typedef implementation defined const_range_type; range_type range( size_t grainsize=1 ); const_range_type range( size_t grainsize=1 ) const; // capacity size_type size() const; bool empty() const; 96 315415-014US size_type max_size() const; size_type bucket_count() const; // iterators typedef implementation defined iterator; typedef implementation defined const_iterator; iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; std::pair equal_range( const Key& key ); std::pair equal_range( const Key& key ) const; }; template bool operator==( const concurrent_hash_map &a, const concurrent_hash_map &b); template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); template void swap(concurrent_hash_map& a, concurrent_hash_map& b) } Exception Safey The following functions must not throw exceptions: • The hash function • The destructors for types Key and T. The following hold true: • If an exception happens during an insert operation, the operation has no effect. Containers Reference Manual 97 • If an exception happens during an assignment operation, the container may be in a state where only some of the items were assigned, and methods size() and empty() may return invalid answers. 5.4.1 Whole Table Operations These operations affect an entire table. Do not concurrently invoke them on the same table. 5.4.1.1 concurrent_hash_map( const allocator_type& a = allocator_type() ) Effects Constructs empty table. 5.4.1.2 concurrent_hash_map( size_type n, const allocator_type& a = allocator_type() ) Effects Construct empty table with preallocated buckets for at least n items. NOTE: In general, thread contention for buckets is inversely related to the number of buckets. If memory consumption is not an issue and P threads will be accessing the concurrent_hash_map, set n=4P. 5.4.1.3 concurrent_hash_map( const concurrent_hash_map& table, const allocator_type& a = allocator_type() ) Effects Copies a table. The table being copied may have const operations running on it concurrently. 5.4.1.4 template concurrent_hash_map( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) Effects Constructs table containing copies of elements in the iterator half-open interval [first,last). 98 315415-014US 5.4.1.5 ~concurrent_hash_map() Effects Invokes clear(). This method is not safe to execute concurrently with other methods on the same concurrent_hash_map. 5.4.1.6 concurrent_hash_map& operator= ( concurrent_hash_map& source ) Effects If source and destination (this) table are distinct, clears the destination table and copies all key-value pairs from the source table to the destination table. Otherwise, does nothing. Returns Reference to the destination table. 5.4.1.7 void swap( concurrent_hash_map& table ) Effects Swaps contents and allocators of this and table. 5.4.1.8 void rehash( size_type n=0 ) Effects Internally, the table is partitioned into buckets. Method rehash reorgnizes these internal buckets in a way that may improve performance of future lookups. Raises number of internal buckets to n if n>0 and n exceeds the current number of buckets. CAUTION: The current implementation never reduces the number of buckets. A future implementation might reduce the number of buckets if n is less than the current number of buckets. NOTE: The ratio of items to buckets affects time and space usage by a table. A high ratio saves space at the expense of time. A low ratio does the opposite. The default ratio is 0.5 to 1 items per bucket on average. 5.4.1.9 void clear() Effects Erases all key-value pairs from the table. Does not hash or compare any keys. Containers Reference Manual 99 If TBB_USE_PERFORMANCE_WARNINGS is nonzero, issues a performance warning if the randomness of the hashing is poor enough to significantly impact performance. 5.4.1.10 allocator_type get_allocator() const Returns Copy of allocator used to construct table. 5.4.2 Concurrent Access Member classes const_accessor and accessor are called accessors. Accessors allow multiple threads to concurrently access pairs in a shared concurrent_hash_map. An accessor acts as a smart pointer to a pair in a concurrent_hash_map. It holds an implicit lock on a pair until the instance is destroyed or method release is called on the accessor. Classes const_accessor and accessor differ in the kind of access that they permit. Table 20: Differences Between const_accessor and accessor Class value_type Implied Lock on pair const_accessor const std::pair Reader lock – permits shared access with other readers. accessor std::pair Writer lock – permits exclusive access by a thread. Blocks access by other threads. Accessors cannot be assigned or copy-constructed, because allowing such would greatly complicate the locking semantics. 5.4.2.1 const_accessor Summary Provides read-only access to a pair in a concurrent_hash_map. Syntax template class concurrent_hash_map::const_accessor; Header #include "tbb/concurrent_hash_map.h" 100 315415-014US Description A const_accessor permits read-only access to a key-value pair in a concurrent_hash_map. Members namespace tbb { template class concurrent_hash_map::const_accessor { public: // types typedef const std::pair value_type; // construction and destruction const_accessor(); ~const_accessor(); // inspection bool empty() const; const value_type& operator*() const; const value_type* operator->() const; // early release void release(); }; } 5.4.2.1.1 bool empty() const Returns True if instance points to nothing; false if instance points to a key-value pair. 5.4.2.1.2 void release() Effects If !empty(), releases the implied lock on the pair, and sets instance to point to nothing. Otherwise does nothing. Containers Reference Manual 101 5.4.2.1.3 const value_type& operator*() const Effects Raises assertion failure if empty() and TBB_USE_ASSERT (3.2.1) is defined as 487H972H nonzero. Returns Const reference to key-value pair. 5.4.2.1.4 const value_type* operator->() const Returns &operator*() 5.4.2.1.5 const_accessor() Effects Constructs const_accessor that points to nothing. 5.4.2.1.6 ~const_accessor Effects If pointing to key-value pair, releases the implied lock on the pair. 5.4.2.2 accessor Summary Class that provides read and write access to a pair in a concurrent_hash_map. Syntax template class concurrent_hash_map::accessor; Header #include "tbb/concurrent_hash_map.h" Description An accessor permits read and write access to a key-value pair in a concurrent_hash_map. It is derived from a const_accessor, and thus can be implicitly cast to a const_accessor. 102 315415-014US Members namespace tbb { template class concurrent_hash_map::accessor: concurrent_hash_map::const_accessor { public: typedef std::pair value_type; value_type& operator*() const; value_type* operator->() const; }; } 5.4.2.2.1 value_type& operator*() const Effects Raises assertion failure if empty() and TBB_USE_ASSERT (3.2.1) is defined as nonzero. 488H973H Returns Reference to key-value pair. 5.4.2.2.2 value_type* operator->() const Returns &operator*() 5.4.3 Concurrent Operations The operations count, find, insert, and erase are the only operations that may be concurrently invoked on the same concurrent_hash_map. These operations search the table for a key-value pair that matches a given key. The find and insert methods each have two variants. One takes a const_accessor argument and provides read-only access to the desired key-value pair. The other takes an accessor argument and provides write access. Additionally, insert has a variant without any accessor. CAUTION: The concurrent operations (count, find, insert, and erase) invalidate any iterators pointing into the affected instance even with const qualifier. It is unsafe to use these operations concurrently with any other operation. An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash 17H . Containers Reference Manual 103 TIP: In serial code, the equal_range method should be used instead of the find method for lookup, since equal_range is faster and does not invalidate iterators. TIP: If the nonconst variant succeeds in finding the key, the consequent write access blocks any other thread from accessing the key until the accessor object is destroyed. Where possible, use the const variant to improve concurrency. Each map operation in this section returns true if the operation succeeds, false otherwise. CAUTION: Though there can be at most one occurrence of a given key in the map, there may be other key-value pairs in flight with the same key. These arise from the semantics of the insert and erase methods. The insert methods can create and destroy a temporary key-value pair that is not inserted into a map. The erase methods remove a key-value pair from the map before destroying it, thus permitting another thread to construct a similar key before the old one is destroyed. TIP: To guarantee that only one instance of a resource exists simultaneously for a given key, use the following technique: • To construct the resource: Obtain an accessor to the key in the map before constructing the resource. • To destroy the resource: Obtain an accessor to the key, destroy the resource, and then erase the key using the accessor. Below is a sketch of how this can be done. extern tbb::concurrent_hash_map Map; void ConstructResource( Key key ) { accessor acc; if( Map.insert(acc,key) ) { // Current thread inserted key and has exclusive access. ...construct the resource here... } // Implicit destruction of acc releases lock } void DestroyResource( Key key ) { accessor acc; if( Map.find(acc,key) ) { // Current thread found key and has exclusive access. ...destroy the resource here... // Erase key using accessor. Map.erase(acc); } } 104 315415-014US 5.4.3.1 size_type count( const Key& key ) const CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns 1 if map contains key; 0 otherwise. 5.4.3.2 bool find( const_accessor& result, const Key& key ) const Effects Searches table for pair with given key. If key is found, sets result to provide read-only access to the matching pair. CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns True if key was found; false if key was not found. 5.4.3.3 bool find( accessor& result, const Key& key ) Effects Searches table for pair with given key. If key is found, sets result to provide write access to the matching pair CAUTION: This method may invalidate previously obtained iterators. In serial code, you can instead use equal_range that does not have such problems. Returns True if key was found; false if key was not found. 5.4.3.4 bool insert( const_accessor& result, const Key& key ) Effects Searches table for pair with given key. If not present, inserts new pair(key,T()) into the table. Sets result to provide read-only access to the matching pair. Containers Reference Manual 105 Returns True if new pair was inserted; false if key was already in the map. 5.4.3.5 bool insert( accessor& result, const Key& key ) Effects Searches table for pair with given key. If not present, inserts new pair(key,T()) into the table. Sets result to provide write access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.6 bool insert( const_accessor& result, const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. Sets result to provide read-only access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.7 bool insert( accessor& result, const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. Sets result to provide write access to the matching pair. Returns True if new pair was inserted; false if key was already in the map. 5.4.3.8 bool insert( const value_type& value ) Effects Searches table for pair with given key. If not present, inserts new pair copyconstructed from value into the table. 106 315415-014US Returns True if new pair was inserted; false if key was already in the map. TIP: If you do not need to access the data after insertion, use the form of insert that does not take an accessor; it may work faster and use fewer locks. 5.4.3.9 template void insert( InputIterator first, InputIterator last ) Effects For each pair p in the half-open interval [first,last), does insert(p). The order of the insertions, or whether they are done concurrently, is unspecified. CAUTION: The current implementation processes the insertions in order. Future implementations may do the insertions concurrently. If duplicate keys exist in [first,last), be careful to not depend on their insertion order. 5.4.3.10 bool erase( const Key& key ) Effects Searches table for pair with given key. Removes the matching pair if it exists. If there is an accessor pointing to the pair, the pair is nonetheless removed from the table but its destruction is deferred until all accessors stop pointing to it. Returns True if pair was removed by the call; false if key was not found in the map. 5.4.3.11 bool erase( const_accessor& item_accessor ) Requirements item_accessor.empty()==false Effects Removes pair referenced by item_accessor. Concurrent insertion of the same key creates a new pair in the table. Returns True if pair was removed by this thread; false if pair was removed by another thread. Containers Reference Manual 107 5.4.3.12 bool erase( accessor& item_accessor ) Requirements item_accessor.empty()==false Effects Removes pair referenced by item_accessor. Concurrent insertion of the same key creates a new pair in the table. Returns True if pair was removed by this thread; false if pair was removed by another thread. 5.4.4 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 974H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. NOTE: Do not call concurrent operations, including count and find while iterating the table. Use concurrent_unordered_map 18H if concurrent traversal and insertion are required. 5.4.4.1 const_range_type range( size_t grainsize=1 ) const Effects Constructs a const_range_type representing all keys in the table. The parameter grainsize is in units of hash table buckets. Each bucket typically has on average about one key-value pair. Returns const_range_type object for the table. 5.4.4.2 range_type range( size_t grainsize=1 ) Returns range_type object for the table. 108 315415-014US 5.4.5 Capacity 5.4.5.1 size_type size() const Returns Number of key-value pairs in the table. NOTE: This method takes constant time, but is slower than for most STL containers. 5.4.5.2 bool empty() const Returns size()==0. NOTE: This method takes constant time, but is slower than for most STL containers. 5.4.5.3 size_type max_size() const Returns Inclusive upper bound on number of key-value pairs that the table can hold. 5.4.5.4 size_type bucket_count() const Returns Current number of internal buckets. See method rehash 19H for discussion of buckets. 5.4.6 Iterators Template class concurrent_hash_map supports forward iterators; that is, iterators that can advance only forwards across a table. Reverse iterators are not supported. Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash 20H . NOTE: Do not call concurrent operations, including count and find while iterating the table. Use concurrent_unordered_map 21H if concurrent traversal and insertion are required. 5.4.6.1 iterator begin() Returns iterator pointing to beginning of key-value sequence. Containers Reference Manual 109 5.4.6.2 iterator end() Returns iterator pointing to end of key-value sequence. 5.4.6.3 const_iterator begin() const Returns const_iterator with pointing to beginning of key-value sequence. 5.4.6.4 const_iterator end() const Returns const_iterator pointing to end of key-value sequence. 5.4.6.5 std::pair equal_range( const Key& key ); Returns Pair of iterators (i,j) such that the half-open range [i,j) contains all pairs in the map (and only such pairs) with keys equal to key. Because the map has no duplicate keys, the half-open range is either empty or contains a single pair. TIP: This method is serial alternative to concurrent count and find methods. 5.4.6.6 std::pair equal_range( const Key& key ) const; Description See 5.4.6.5. 975H 5.4.7 Global Functions These functions in namespace tbb improve the STL compatibility of concurrent_hash_map. 110 315415-014US 5.4.7.1 template bool operator==( const concurrent_hash_map& a, const concurrent_hash_map& b); Returns True if a and b contain equal sets of keys and for each pair (k,v1)?a and pair ,v2)?b, the expression bool(v1==v2) is true. 5.4.7.2 template bool operator!=(const concurrent_hash_map &a, const concurrent_hash_map &b); Returns !(a==b) 5.4.7.3 template void swap(concurrent_hash_map &a, concurrent_hash_map &b) Effects a.swap(b) 5.4.8 tbb_hash_compare Class Summary Default HashCompare for concurrent_hash_map. Syntax template struct tbb_hash_compare; Header #include "tbb/concurrent_hash_map.h" Containers Reference Manual 111 Description A tbb_hash_compare is the default for the HashCompare argument of template class concurrent_hash_map. The built-in definition relies on operator== and tbb_hasher as shown in the Members description. For your own types, you can define a template specialization of tbb_hash_compare or define an overload of tbb_hasher. There are built-in definitions of tbb_hasher for the following Key types: • Types that are convertible to a size_t by static_cast • Pointer types • std::basic_string • std::pair where K1 and K2 are hashed using tbb_hasher. Members namespace tbb { template struct tbb_hash_compare { static size_t hash(const Key& a) { return tbb_hasher(a); } static bool equal(const Key& a, const Key& b) { return a==b; } }; template size_t tbb_hasher(const T&); template size_t tbb_hasher(T*); template size_t tbb_hasher(const std::basic_string&); template size_t tbb_hasher(const std::pair& ); }; 112 315415-014US 5.5 concurrent_queue Template Class Summary Template class for queue with concurrent operations. Syntax template > class concurrent_queue; Header #include "tbb/concurrent_queue.h" Description A concurrent_queue is a first-in first-out data structure that permits multiple threads to concurrently push and pop items. Its capacity is unbounded6F 7 , subject to memory limitations on the target machine. The interface is similar to STL std::queue except where it must differ to make concurrent modification safe. Table 21: Differences Between STL queue and Intel® Threading Building Blocks concurrent_queue Feature STL std::queue concurrent_queue Access to front and back Methods front and back Not present. They would be unsafe while concurrent operations are in progress. size_type unsigned integral type signed integral type unsafe_size() Returns number of items in queue Returns number of items in queue. May return incorrect value if any push or try_pop operations are concurrently in flight. 7 In Intel® TBB 2.1, a concurrent_queue could be bounded. Intel® TBB 2.2 moves this functionality to concurrent_bounded_queue. Compile with TBB_DEPRECATED=1 to restore the old functionality, or (recommended) use concurrent_bounded_queue instead. Containers Reference Manual 113 Feature STL std::queue concurrent_queue Copy and pop item unless queue q is empty. bool b=!q.empty(); if(b) { x=q.front(); q.pop(); } bool b = q.try_pop (x) Members namespace tbb { template > class concurrent_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; typedef Alloc allocator_type; explicit concurrent_queue(const Alloc& a = Alloc ()); concurrent_queue(const concurrent_queue& src, const Alloc& a = Alloc()); template concurrent_queue(InputIterator first, InputIterator last, const Alloc& a = Alloc()); ~concurrent_queue(); void push( const T& source ); bool try_pop7F 8 ( T& destination ); void clear() ; size_type unsafe_size() const; bool empty() const; Alloc get_allocator() const; 8 Called pop_if_present in Intel® TBB 2.1. Compile with TBB_DEPRECATED=1 to use the old name. 114 315415-014US typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow and intended only for debugging) iterator unsafe_begin(); iterator unsafe_end(); const_iterator unsafe_begin() const; const_iterator unsafe_end() const; }; } 5.5.1 concurrent_queue( const Alloc& a = Alloc () ) Effects Constructs empty queue. 5.5.2 concurrent_queue( const concurrent_queue& src, const Alloc& a = Alloc() ) Effects Constructs a copy of src. 5.5.3 template concurrent_queue( InputIterator first, InputIterator last, const Alloc& a = Alloc() ) Effects Constructs a queue containing copies of elements in the iterator half-open interval [first,last). 5.5.4 ~concurrent_queue() Effects Destroys all items in the queue. Containers Reference Manual 115 5.5.5 void push( const T& source ) Effects Pushes a copy of source onto back of the queue. 5.5.6 bool try_pop ( T& destination ) Effects If value is available, pops it from the queue, assigns it to destination, and destroys the original value. Otherwise does nothing. Returns True if value was popped; false otherwise. 5.5.7 void clear() Effects Clears the queue. Afterwards size()==0. 5.5.8 size_type unsafe_size() const Returns Number of items in the queue. If there are concurrent modifications in flight, the value might not reflect the actual number of items in the queue. 5.5.9 bool empty() const Returns true if queue has no items; false otherwise. 5.5.10 Alloc get_allocator() const Returns Copy of allocator used to construct the queue. 116 315415-014US 5.5.11 Iterators A concurrent_queue provides limited iterator support that is intended solely to allow programmers to inspect a queue during debugging. It provides iterator and const_iterator types. Both follow the usual STL conventions for forward iterators. The iteration order is from least recently pushed to most recently pushed. Modifying a concurrent_queue invalidates any iterators that reference it. CAUTION: The iterators are relatively slow. They should be used only for debugging. Example The following program builds a queue with the integers 0..9, and then dumps the queue to standard output. Its overall effect is to print 0 1 2 3 4 5 6 7 8 9. #include "tbb/concurrent_queue.h" #include using namespace std; using namespace tbb; int main() { concurrent_queue queue; for( int i=0; i<10; ++i ) queue.push(i); typedef concurrent_queue::iterator iter; for(iter i(queue.unsafe_begin()); i!=queue.unsafe_end(); ++i) cout << *i << " "; cout << endl; return 0; } 5.5.11.1 iterator unsafe_begin() Returns iterator pointing to beginning of the queue. 5.5.11.2 iterator unsafe_end() Returns iterator pointing to end of the queue. Containers Reference Manual 117 5.5.11.3 const_iterator unsafe_begin() const Returns const_iterator with pointing to beginning of the queue. 5.5.11.4 const_iterator unsafe_end() const Returns const_iterator pointing to end of the queue. 5.6 concurrent_bounded_queue Template Class Summary Template class for bounded dual queue with concurrent operations. Syntax template > class concurrent_bounded_queue; Header #include "tbb/concurrent_queue.h" Description A concurrent_bounded_queue is similar to a concurrent_queue, but with the following differences: • Adds the ability to specify a capacity. The default capacity makes the queue practically unbounded. • Changes the push operation so that it waits until it can complete without exceeding the capacity. • Adds a waiting pop operation that waits until it can pop an item. • Changes the size_type to a signed type. • Changes the size() operation to return the number of push operations minus the number of pop operations. For example, if there are 3 pop operations waiting on an empty queue, size() returns -3. 118 315415-014US Members To aid comparison, the parts that differ from concurrent_queue are in bold and annotated. namespace tbb { template > class concurrent_bounded_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef Alloc allocator_type; // size_type is signed type typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; explicit concurrent_bounded_queue(const allocator_type& a = allocator_type()); concurrent_bounded_queue( const concurrent_bounded_queue& src, const allocator_type& a = allocator_type()); template concurrent_bounded_queue( InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()); ~concurrent_bounded_queue(); // waits until it can push without exceeding capacity. void push( const T& source ); // waits if *this is empty void pop( T& destination ); // skips push if it would exceed capacity. bool try_push8F 9 ( const T& source ); bool try_pop9F 10 ( T& destination ); void clear() ; 9 Method try_push was called push_if_not_full in Intel® TBB 2.1. 10 Method try_pop was called pop_if_present in Intel® TBB 2.1. Containers Reference Manual 119 // safe to call during concurrent modification, can return negative size. size_type size() const; bool empty() const; size_type capacity() const; void set_capacity( size_type capacity ); allocator_type get_allocator() const; typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow an intended only for debugging) iterator unsafe_begin(); iterator unsafe_end(); const_iterator unsafe_begin() const; const_iterator unsafe_end() const; }; } Because concurrent_bounded_queue is similar to concurrent_queue, the following subsections described only methods that differ. 5.6.1 void push( const T& source ) Effects Waits until size(), typename Alloc=cache_aligned_allocator > class concurrent_priority_queue; Header #include “tbb/concurrent_priority_queue.h” Description A concurrent_priority_queue is a container that permits multiple threads to concurrently push and pop items. Items are popped in priority order as determined by a template parameter. The queue’s capacity is unbounded, subject to memory limitations on the target machine. The interface is similar to STL std::priority_queue except where it must differ to make concurrent modification safe. Table 43: Differences between STL priority_queue and Intel® Threading Building Blocks concurrent_priority_queue Feature STL std::priority_queue concurrent_priority_queue Choice of underlying container Sequence template parameter No choice of underlying container; allocator choice is provided instead Access to highest priority item const value_type& top() const Not available. Unsafe for concurrent container Copy and pop item if present bool b=!q.empty(); if(b) { x=q.top(); q.pop(); } bool b = q.try_pop(x); Get number of items in queue size_type size() const Same, but may be inaccurate due to pending concurrent push or pop operations Check if there are items in queue bool empty() const Same, but may be inaccurate due to pending concurrent push or 122 315415-014US pop operations Members namespace tbb { template , typename A=cache_aligned_allocator > class concurrent_priority_queue { typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef size_t size_type; typedef ptrdiff_t difference_type; typedef A allocator_type; concurrent_priority_queue(const allocator_type& a = allocator_type()); concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type()); template concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()); concurrent_priority_queue(const concurrent_priority_queue& src, const allocator_type& a = allocator_type()); concurrent_priority_queue& operator=(const concurrent_priority_queue& src); ~concurrent_priority_queue(); bool empty() const; size_type size() const; void push(const_reference elem); bool try_pop(reference elem); void clear(); void swap(concurrent_priority_queue& other); allocator_type get_allocator() const; }; } Containers Reference Manual 123 5.7.1 concurrent_priority_queue(const allocator_type& a = allocator_type()) Effects Constructs empty queue. 5.7.2 concurrent_priority_queue(size_type init_capacity, const allocator_type& a = allocator_type()) Effects Constructs an empty queue with an initial capacity. 5.7.3 concurrent_priority_queue(InputIterator begin, InputIterator end, const allocator_type& a = allocator_type()) Effects Constructs a queue containing copies of elements in the iterator half-open interval [begin, end). 5.7.4 concurrent_priority_queue (const concurrent_priority_queue& src, const allocator_type& a = allocator_type()) Effects Constructs a copy of src. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src. 5.7.5 concurrent_priority_queue& operator=(const concurrent_priority_queue& src) Effects Assign contents of src to *this. This operation is not thread-safe and may result in an error or an invalid copy of src if another thread is concurrently modifying src. 124 315415-014US 5.7.6 ~concurrent_priority_queue() Effects Destroys all items in the queue, and the container itself, so that it can no longer be used. 5.7.7 bool empty() const Returns true if queue has no items; false otherwise. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently. 5.7.8 size_type size() const Returns Number of items in the queue. May be inaccurate when concurrent push or try_pop operations are pending. This operation reads shared data and may trigger a race condition in race detection tools when used concurrently. 5.7.9 void push(const_reference elem) Effects Pushes a copy of elem into the queue. This operation is thread-safe with other push and try_pop operations. 5.7.10 bool try_pop(reference elem) Effects If the queue is not empty, copies the highest priority item from the queue and assigns it to elem, and destroys the popped item in the queue; otherwise, does nothing. This operation is thread-safe with other push and try_pop operations. Returns true if an item was popped; false otherwise. Containers Reference Manual 125 5.7.11 void clear() Effects Clears the queue; results in size()==0. This operation is not thread-safe. 5.7.12 void swap(concurrent_priority_queue& other) Effects Swaps the queue contents with those of other. This operation is not thread-safe. 5.7.13 allocator_type get_allocator() const Returns Copy of allocator used to construct the queue. 5.8 concurrent_vector Summary Template class for vector that can be concurrently grown and accessed. Syntax template > class concurrent_vector; Header #include "tbb/concurrent_vector.h" Description A concurrent_vector is a container with the following features: • Random access by index. The index of the first element is zero. • Multiple threads can grow the container and append new elements concurrently. • Growing the container does not invalidate existing iterators or indices. A concurrent_vector meets all requirements for a Container and a Reversible Container as specified in the ISO C++ standard. It does not meet the Sequence requirements due to absence of methods insert() and erase(). 126 315415-014US Members namespace tbb { template > class concurrent_vector { public: typedef size_t size_type; typedef allocator-A-rebound-for-T 10F 11 allocator_type; typedef T value_type; typedef ptrdiff_t difference_type; typedef T& reference; typedef const T& const_reference; typedef T* pointer; typedef const T *const_pointer; typedef implementation-defined iterator; typedef implementation-defined const_iterator; typedef implementation-defined reverse_iterator; typedef implementation-defined const_reverse_iterator; // Parallel ranges typedef implementation-defined range_type; typedef implementation-defined const_range_type; range_type range( size_t grainsize ); const_range_type range( size_t grainsize ) const; // Constructors explicit concurrent_vector( const allocator_type& a = allocator_type() ); concurrent_vector( const concurrent_vector& x ); template concurrent_vector( const concurrent_vector& x ); explicit concurrent_vector( size_type n, const T& t=T(), const allocator_type& a = allocator_type() ); template concurrent_vector(InputIterator first, InputIterator last, const allocator_type& a=allocator_type()); 11 This rebinding follows practice established by both the Microsoft and GNU implementations of std::vector. Containers Reference Manual 127 // Assignment concurrent_vector& operator=( const concurrent_vector& x ); template concurrent_vector& operator=( const concurrent_vector& x ); void assign( size_type n, const T& t ); template void assign( InputIterator first, InputIterator last ); // Concurrent growth operations11F 12 iterator grow_by( size_type delta ); iterator grow_by( size_type delta, const T& t ); iterator grow_to_at_least( size_type n ); iterator push_back( const T& item ); // Items access reference operator[]( size_type index ); const_reference operator[]( size_type index ) const; reference at( size_type index ); const_reference at( size_type index ) const; reference front(); const_reference front() const; reference back(); const_reference back() const; // Storage bool empty() const; size_type capacity() const; size_type max_size() const; size_type size() const; allocator_type get_allocator() const; // Non-concurrent operations on whole container void reserve( size_type n ); void compact(); void swap( concurrent_vector& vector ); 12 The return types of the growth methods are different in Intel® TBB 2.2 than in prior versions. See footnotes in the descriptions of the individual methods for details. 128 315415-014US void clear(); ~concurrent_vector(); // Iterators iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; reverse_iterator rbegin(); reverse_iterator rend(); const_reverse_iterator rbegin() const; const_reverse_iterator rend() const; // C++0x extensions const_iterator cbegin() const; const_iterator cend() const; const_reverse_iterator crbegin() const; const_reverse_iterator crend() const; }; // Template functions template bool operator==( const concurrent_vector& a, const concurrent_vector& b ); template bool operator!=( const concurrent_vector& a, const concurrent_vector& b ); template bool operator<( const concurrent_vector& a, const concurrent_vector& b ); template bool operator>( const concurrent_vector& a, const concurrent_vector& b ); template bool operator<=( const concurrent_vector& a, const concurrent_vector& b ); template bool operator>=(const concurrent_vector& a, const concurrent_vector& b ); Containers Reference Manual 129 template void swap(concurrent_vector& a, concurrent_vector& b); } Exception Safety Concurrent growing is fundamentally incompatible with ideal exception safety.12F 13 Nonetheless, concurrent_vector offers a practical level of exception safety. Element type T must meet the following requirements: • Its destructor must not throw an exception. • If its default constructor can throw an exception, its destructor must be non-virtual and work correctly on zero-filled memory. Otherwise the program’s behavior is undefined. Growth (5.8.3) and vector assignment ( 976H 5.8.1) append a sequence of elements to a 977H vector. If an exception occurs, the impact on the vector depends upon the cause of the exception: • If the exception is thrown by the constructor of an element, then all subsequent elements in the appended sequence will be zero-filled. • Otherwise, the exception was thrown by the vector's allocator. The vector becomes broken. Each element in the appended sequence will be in one of three states: o constructed o zero-filled o unallocated in memory Once a vector becomes broken, care must be taken when accessing it: • Accessing an unallocated element with method at causes an exception std::range_error. Any other way of accessing an unallocated element has undefined behavior. • The values of capacity() and size() may be less than expected. • Access to a broken vector via back()has undefined behavior. However, the following guarantees hold for broken or unbroken vectors: 13 For example, consider P threads each appending N elements. To be perfectly exception safe, these operations would have to be serialized, because each operation has to know that the previous operation succeeded before allocating more indices. 130 315415-014US • Let k be an index of an unallocated element. Then size()=capacity()=k. • Growth operations never cause size() or capacity() to decrease. If a concurrent growth operation successfully completes, the appended sequence remains valid and accessible even if a subsequent growth operations fails. Fragmentation Unlike a std::vector, a concurrent_vector never moves existing elements when it grows. The container allocates a series of contiguous arrays. The first reservation, growth, or assignment operation determines the size of the first array. Using a small number of elements as initial size incurs fragmentation across cache lines that may increase element access time. The method shrink_to_fit()merges several smaller arrays into a single contiguous array, which may improve access time. 5.8.1 Construction, Copy, and Assignment Safety These operations must not be invoked concurrently on the same vector. 5.8.1.1 concurrent_vector( const allocator_type& a = allocator_type() ) Effects Constructs empty vector using optionally specified allocator instance. 5.8.1.2 concurrent_vector( size_type n, const_reference t=T(), const allocator_type& a = allocator_type() ); Effects Constructs vector of n copies of t, using optionally specified allocator instance. If t is not specified, each element is default constructed instead of copied. 5.8.1.3 template concurrent_vector( InputIterator first, InputIterator last, const allocator_type& a = allocator_type() ) Effects Constructs vector that is copy of the sequence [first,last), making only N calls to the copy constructor of T, where N is the distance between first and last. Containers Reference Manual 131 5.8.1.4 concurrent_vector( const concurrent_vector& src ) Effects Constructs copy of src. 5.8.1.5 concurrent_vector& operator=( const concurrent_vector& src ) Effects Assigns contents of src to *this. Returns Reference to left hand side. 5.8.1.6 template concurrent_vector& operator=( const concurrent_vector& src ) Assign contents of src to *this. Returns Reference to left hand side. 5.8.1.7 void assign( size_type n, const_reference t ) Assign n copies of t. 5.8.1.8 template void assign( InputIterator first, InputIterator last ) Assign copies of sequence [first,last), making only N calls to the copy constructor of T, where N is the distance between first and last. 5.8.2 Whole Vector Operations Safety Concurrent invocation of these operations on the same instance is not safe. 132 315415-014US 5.8.2.1 void reserve( size_type n ) Effects Reserves space for at least n elements. Throws std::length_error if n>max_size(). It can also throw an exception if the allocator throws an exception. Safety If an exception is thrown, the instance remains in a valid state. 5.8.2.2 void shrink_to_fit()13F 14 Effects Compacts the internal representation to reduce fragmentation. 5.8.2.3 void swap( concurrent_vector& x ) Swap contents of two vectors. Takes O(1) time. 5.8.2.4 void clear() Effects Erases all elements. Afterwards, size()==0. Does not free internal arrays.14F 15 TIP: To free internal arrays, call shrink_to_fit() after clear(). 5.8.2.5 ~concurrent_vector() Effects Erases all elements and destroys the vector. 14 Method shrink_to_fit was called compact() in Intel® TBB 2.1. It was renamed to match the C++0x std::vector::shrink_to_fit(). 15 The original release of Intel® TBB 2.1 and its “update 1” freed the arrays. The change in “update 2” reverts back to the behavior of Intel® TBB 2.0. The motivation for not freeing the arrays is to behave similarly to std::vector::clear(). Containers Reference Manual 133 5.8.3 Concurrent Growth Safety The methods described in this section may be invoked concurrently on the same vector. 5.8.3.1 iterator grow_by( size_type delta, const_reference t=T() )15F 16 Effects Appends a sequence comprising delta copies of t to the end of the vector. If t is not specified, the new elements are default constructed. Returns Iterator pointing to beginning of appended sequence. 5.8.3.2 iterator grow_to_at_least( size_type n ) 16F 17 Effects Appends minimal sequence of elements such that vector.size()>=n. The new elements are default constructed. Blocks until all elements in range [0..n) are allocated (but not necessarily constructed if they are under construction by a different thread). TIP: If a thread must know whether construction of an element has completed, consider the following technique. Instantiate the concurrent_vector using a zero_allocator (8.5). Define the constructor 978H T() such that when it completes, it sets a field of T to non-zero. A thread can check whether an item in the concurrent_vector is constructed by checking whether the field is non-zero. Returns Iterator that points to beginning of appended sequence, or pointer to (*this)[n] if no elements were appended. 16 Return type was size_type in Intel® TBB 2.1. 17 Return type was void in Intel® TBB 2.1. 134 315415-014US 5.8.3.3 iterator push_back( const_reference value )17F 18 Effects Appends copy of value to the end of the vector. Returns Iterator that points to the copy. 5.8.4 Access Safety The methods described in this section may be concurrently invoked on the same vector as methods for concurrent growth (5.8.3). However, the returned reference may be to 979H an element that is being concurrently constructed. 5.8.4.1 reference operator[]( size_type index ) Returns Reference to element with the specified index. 5.8.4.2 const_refrence operator[]( size_type index ) const Returns Const reference to element with the specified index. 5.8.4.3 reference at( size_type index ) Returns Reference to element at specified index. Throws std::out_of_range if index = size(). 18 Return type was size_type in Intel® TBB 2.1. Containers Reference Manual 135 5.8.4.4 const_reference at( size_type index ) const Returns Const reference to element at specified index. Throws std::out_of_range if index = size() or index is for broken portion of vector. 5.8.4.5 reference front() Returns (*this)[0] 5.8.4.6 const_reference front() const Returns (*this)[0] 5.8.4.7 reference back() Returns (*this)[size()-1] 5.8.4.8 const_reference back() const Returns (*this)[size()-1] 5.8.5 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.495H980H 1981H5.1). The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 5.8.5.1 range_type range( size_t grainsize=1 ) Returns Range over entire concurrent_vector that permits read-write access. 136 315415-014US 5.8.5.2 const_range_type range( size_t grainsize=1 ) const Returns Range over entire concurrent_vector that permits read-only access. 5.8.6 Capacity 5.8.6.1 size_type size() const Returns Number of elements in the vector. The result may include elements that are allocated but still under construction by concurrent calls to any of the growth methods (5.8.3). 982H 5.8.6.2 bool empty() const Returns size()==0 5.8.6.3 size_type capacity() const Returns Maximum size to which vector can grow without having to allocate more memory. NOTE: Unlike an STL vector, a concurrent_vector does not move existing elements if it allocates more memory. 5.8.6.4 size_type max_size() const Returns Highest possible size of the vector could reach. 5.8.7 Iterators Template class concurrent_vector supports random access iterators as defined in Section 24.1.4 of the ISO C++ Standard. Unlike a std::vector, the iterators are not raw pointers. A concurrent_vector meets the reversible container requirements in Table 66 of the ISO C++ Standard. Containers Reference Manual 137 5.8.7.1 iterator begin() Returns iterator pointing to beginning of the vector. 5.8.7.2 const_iterator begin() const Returns const_iterator pointing to beginning of the vector. 5.8.7.3 iterator end() Returns iterator pointing to end of the vector. 5.8.7.4 const_iterator end() const Returns const_iterator pointing to end of the vector. 5.8.7.5 reverse_iterator rbegin() Returns reverse iterator pointing to beginning of reversed vector. 5.8.7.6 const_reverse_iterator rbegin() const Returns const_reverse_iterator pointing to beginning of reversed vector. 5.8.7.7 iterator rend() Returns const_reverse_iterator pointing to end of reversed vector. 5.8.7.8 const_reverse_iterator rend() Returns const_reverse_iterator pointing to end of reversed vector. 138 315415-014US 6 Flow Graph There are some applications that best express dependencies as messages passed between nodes in a flow graph. These messages may contain data or simply act as signals that a predecessor has completed. The graph class and its associated node classes can be used to express such applications. All graph-related classes and functions are in the tbb::flow namespace. Primary Components There are 3 types of components used to implement a graph: A graph object Nodes Edges The graph object is the owner of the tasks created on behalf of the flow graph. Users can wait on the graph if they need to wait for the completion of all of the tasks related to the flow graph execution. One can also register external interactions with the graph and run tasks under the ownership of the flow graph. Nodes invoke user-provided function objects or manage messages as the flow to/from other nodes. There are pre-defined nodes that buffer, filter, broadcast or order items as they flow through the graph. Edges are the connections between the nodes, created by calls to the make_edge function. Message Passing Protocol In an Intel® TBB flow graph, edges dynamically switch between a push and pull protocol for passing messages. An Intel® TBB flow graph G = ( V, S, L ), where V is the set of nodes, S is the set of edges that are currently using a push protocol, and L is the set of edges that are currently using a pull protocol. For each edge (Vi, Vj), Vi is the predecessor / sender and Vj is the successor / receiver. When in the push set S, messages over an edge are initiated by the sender, which tries to put to the receiver. When in the pull set, messages are initiated by the receiver, which tries to get from the sender. If a message attempt across an edge fails, the edge is moved to the other set. For example, if a put across the edge (Vi, Vj) fails, the edge is removed from the push set S and placed in the pull set L. This dynamic push/pull protocol is the key to performance in a non-preemptive tasking library such as Intel® TBB, where simply Flow Graph Reference Manual 139 repeating failed sends or receives is not an efficient option. Figure 4 summarizes this 983H dynamic protocol. Use Push Protcol for (Vs , Vr ) Use Pull Protcol for (Vs , Vr ) Putto Vr rejected Requestfrom Vs rejected Putto Vr accepted Requestfrom Vs accepted Figure 4: The dynamic push / pull protocol. Body Objects Some nodes execute user-provided body objects. These objects can be created by instatiating function objects or lambda expressions. The nodes that use body objects include cotinue_node, function_node and source_node. CAUTION: The body objects passed to the flow graph nodes are copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 984H Dependency Flow Graph Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; struct body { std::string my_name; body( const char *name ) : my_name(name) {} void operator()( continue_msg ) const { printf("%s\n", my_name.c_str()); } }; int main() { graph g; 140 315415-014US broadcast_node< continue_msg > start; continue_node a( g, body("A")); continue_node b( g, body("B")); continue_node c( g, body("C")); continue_node d( g, body("D")); continue_node e( g, body("E")); make_edge( start, a ); make_edge( start, b ); make_edge( a, c ); make_edge( b, c ); make_edge( c, d ); make_edge( a, e ); for (int i = 0; i < 3; ++i ) { start.try_put( continue_msg() ); g.wait_for_all(); } return 0; } In this example, five computations A-E are setup with the partial ordering shown in Figure 5. For each edge in the flow graph, the node at the tail of the edge must 985H complete its execution before the node at the head may begin. NOTE: This is a simple syntactic example only. Since each node in a flow graph may execute as an independent task, the granularity of each node should follow the general guidelines for tasks as described in Section 3.2.3 of the Intel® Threading Building Blocks Tutorial. Flow Graph Reference Manual 141 Figure 5: A simple dependency graph. In this example, nodes A-E print out their names. All of these nodes are therefore able to use struct body to construct their body objects. In function main, the flow graph is set up once and then run three times. All of the nodes in this example pass around continue_msg objects. This type is described in Section 6.4 and is used to communicate that a node has completed its execution. 986H The first line in function main instantiates a graph object, g. On the next line, a broadcast_node named start is created. Anything passed to this node will be broadcast to all of its successors. The node start is used in the for loop at the bottom of main to launch the execution of the rest of the flow graph. In the example, five continue_node objects are created, named a – e. Each node is constructed with a reference to graph g and the function object to invoke when it runs. The successor / predecessor relationships are set up by the make_edge calls that follow the declaration of the nodes. After the nodes and edges are set up, the try_put in each iteration of the for loop results in a broadcast of a continue_msg to both a and b. Both a and b are waiting for a single continue_msg, since they both have only a single predecessor, start. When they receive the message from start, they execute their body objects. When complete, they each forward a continue_msg to their successors, and so on. The graph 142 315415-014US uses tasks to execute the node bodies as well as to forward messages between the nodes, allowing computation to execute concurrently when possible. The classes and functions used in this example are described in detail in the remaining sections in Appendix D. Message Flow Graph Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; struct square { int operator()(int v) { return v*v; } }; struct cube { int operator()(int v) { return v*v*v; } }; class sum { int &my_sum; public: sum( int &s ) : my_sum(s) {} int operator()( std::tuple< int, int > v ) { my_sum += std::get<0>(v) + std::get<1>(v); return my_sum; } }; int main() { int result = 0; graph g; broadcast_node input; function_node squarer( g, unlimited, square() ); function_node cuber( g, unlimited, cube() ); join_node< std::tuple, queueing > join( g ); function_node,int> summer( g, serial, sum(result) ); make_edge( input, squarer ); make_edge( input, cuber ); make_edge( squarer, std::get<0>( join.inputs() ) ); make_edge( cuber, std::get<1>( join.inputs() ) ); Flow Graph Reference Manual 143 make_edge( join, summer ); for (int i = 1; i <= 10; ++i) input.try_put(i); g.wait_for_all(); printf("Final result is %d\n", result); return 0; } This example calculates the sum of x*x + x*x*x for all x = 1 to 10. NOTE: This is a simple syntactic example only. Since each node in a flow graph may execute as an independent task, the granularity of each node should follow the general guidelines for tasks as described in Section 3.2.3 of the Intel® Threading Building Blocks Tutorial. The layout of this example is shown in Figure 6. Each value enters through the 987H broadcast_node input. This node broadcasts the value to both squarer and cuber, which calculate x*x and x*x*x respectively. The output of each of these nodes is put to one of join’s ports. A tuple containing both values is created by join_node< tuple > join and forwarded to summer, which adds both values to the running total. Both squarer and cuber allow unlimited concurrency, that is they each may process multiple values simultaneously. The final summer, which updates a shared total, is only allowed to process a single incoming tuple at a time, eliminating the need for a lock around the shared value. The classes square, cube and sum define the three user-defined operations. Each class is used to create a function_node. In function main, the flow graph is setup and then the values 1 – 10 are put into the node input. All the nodes in this example pass around values of type int. The nodes used in this example are all class templates and therefore can be used with any type that supports copy construction, including pointers and objects. CAUTION: Values are copied as they pass between nodes and therefore passing around large objects should be avoided. To avoid large copy overheads, pointers to large objects can be passed instead. 144 315415-014US Figure 6: A simple message flow graph. The classes and functions used in this example are described in detail in the remaining sections of Appendix D. 6.1 graph Class Summary Class that serves as a handle to a flow graph of nodes and edges. Syntax class graph; Header #include "tbb/flow_graph.h" Description A graph object contains a root task that is the parent of all tasks created on behalf of the flow graph and its nodes. It provides methods that can be used to access the root task, to wait for the children of the root task to complete, to explicitly increment or decrement the root task’s reference count, and to run a task as a child of the root task. CAUTION: Destruction of flow graph nodes before calling wait_for_all on their associated graph object has undefined behavior and can lead to program failure. Members namespace tbb { namespace flow { Flow Graph Reference Manual 145 class graph { public: graph(); ~graph(); void increment_wait_count(); void decrement_wait_count(); template< typename Receiver, typename Body > void run( Receiver &r, Body body ); template< typename Body > void run( Body body ); void wait_for_all(); task * root_task(); }; } } 6.1.1 graph() Effects Constructs a graph with no nodes. Instantiates a root task of class empty_task to serve as a parent for all of the tasks generated during runs of the graph. Sets ref_count of the root task to 1. 6.1.2 ~graph() Effects Calls wait_for_all on the graph, then destroys the root task. 6.1.3 void increment_wait_count() Description Used to register that an external entity may still interact with the graph. Effects Increments the ref_count of the root task. 146 315415-014US 6.1.4 void decrement_wait_count() Description Used to unregister an external entity that may have interacted with the graph. Effects Decrements the ref_count of the root task. 6.1.5 template< typename Receiver, typename Body > void run( Receiver &r, Body body ) Description This method can be used to enqueue a task that runs a body and puts its output to a specific receiver. The task is created as a child of the graph’s root task and therefore wait_for_all will not return until this task completes. Effects Enqueues a task that invokes r.try_put( body() ). It does not wait for the task to complete. The enqueued task is a child of the root task. 6.1.6 template< typename Body > void run( Body body ) Description This method enqueues a task that runs as a child of the graph’s root task. Calls to wait_for_all will not return until this enqueued task completes. Effects Enqueues a task that invokes body(). It does not wait for the task to complete. 6.1.7 void wait_for_all() Effect Blocks until all tasks associated with the root task have completed and the number of decrement_wait_count calls equals the number of increment_wait_count calls. Because it calls wait_for_all on the root graph task, the calling thread may participate in work-stealing while it is blocked. Flow Graph Reference Manual 147 6.1.8 task *root_task() Retuns Returns a pointer to the root task of the flow graph. 6.2 sender Template Class Summary An abstract base class for nodes that act as message senders. Syntax template< typename T > class sender; Header #include "tbb/flow_graph.h" Description The sender template class is an abstract base class that defines the interface for nodes that can act as senders. Default implementations for several functions are provided. Members namespace tbb { namespace flow { template< typename T > class sender { public: typedef T output_type; typedef receiver successor_type; virtual ~sender(); virtual bool register_successor( successor_type &r ) = 0; virtual bool remove_successor( successor_type &r ) = 0; virtual bool try_get( output_type & ) { return false; } virtual bool try_reserve( output_type & ) { return false; } virtual bool try_release( ) { return false; } virtual bool try_consume( ) { return false; } }; } } 148 315415-014US 6.2.1 ~sender() Description The destructor. 6.2.2 bool register_successor( successor_type & r ) = 0 Description A pure virtual method that describes the interface for adding a successor node to the set of successors for the sender. Returns True if the successor is added. False otherwise. 6.2.3 bool remove_successor( successor_type & r ) = 0 Description A pure virtual method that describes the interface for removing a successor node from the set of successors for a sender. Returns True if the successor is removed. False otherwise. 6.2.4 bool try_get( output_type & ) Description Requests an item from a sender. Returns The default implementation returns false. Flow Graph Reference Manual 149 6.2.5 bool try_reserve( output_type & ) Description Reserves an item at the sender. Returns The default implementation returns false. 6.2.6 bool try_release( ) Description Releases the reservation held at the sender. Returns The default implementation returns false. 6.2.7 bool try_consume( ) Description Consumes the reservation held at the sender. Effect The default implementation returns false. 6.3 receiver Template Class Summary An abstract base class for nodes that act as message receivers. Syntax template< typename T > class receiver; Header #include "tbb/flow_graph.h" 150 315415-014US Description The receiver template class is an abstract base class that defines the interface for nodes that can act as receivers. Default implementations for several functions are provided. Members namespace tbb { namespace flow { template< typename T > class receiver { public: typedef T input_type; typedef sender predecessor_type; virtual ~receiver(); virtual bool try_put( const input_type &v ) = 0; virtual bool register_predecessor( predecessor_type &p ) { return false; } virtual bool remove_predecessor( predecessor_type &p ) { return false; } }; } } 6.3.1 ~receiver() Description The destructor. 6.3.2 bool register_predecessor( predecessor_type & p ) Description Adds a predecessor to the node’s set of predecessors. Returns True if the predecessor is added. False otherwise. The default implementation returns false. Flow Graph Reference Manual 151 6.3.3 bool remove_predecessor( predecessor_type & p ) Description Removes a predecessor from the node’s set of predecessors. Returns True if the predecessor is removed. False otherwise. The default implementation returns false. 6.3.4 bool try_put( const input_type &v ) = 0 Description A pure virtual method that represents the interface for putting an item to a receiver. 6.4 continue_msg Class Summary An empty class that represent a continue message. This class is used to indicate that the sender has completed. Syntax class continue_msg; Header #include "tbb/flow_graph.h" Members namespace tbb { namespace flow { class continue_msg {}; } } 6.5 continue_receiver Class Summary An abstract base class for nodes that act as receivers of continue_msg objects. These nodes call a method execute when the number of try_put calls reaches a threshold that represents the number of known predecessors. 152 315415-014US Syntax class continue_receiver; Header #include "tbb/flow_graph.h" Description This type of node is triggered when its method try_put has been called a number of times that is equal to the number of known predecessors. When triggered, the node calls the method execute, then resets and will fire again when it receives the correct number of try_put calls. This node type is useful for dependency graphs, where each node must wait for its predecessors to complete before executing, but no explicit data is passed across the edge. Members namespace tbb { namespace flow { class continue_receiver : public receiver< continue_msg > { public: typedef continue_msg input_type; typedef sender< input_type > predecessor_type; continue_receiver( int num_predecessors = 0 ); continue_receiver( const continue_receiver& src ); virtual ~continue_receiver(); virtual bool try_put( const input_type &v ); virtual bool register_predecessor( predecessor_type &p ); virtual bool remove_predecessor( predecessor_type &p ); protected: virtual void execute() = 0; }; } } 6.5.1 continue_receiver( int num_predecessors = 0 ) Effect Constructs a continue_receiver that is initialized to trigger after receiving num_predecessors calls to try_put. Flow Graph Reference Manual 153 6.5.2 continue_receiver( const continue_receiver& src ) Effect Constructs a continue_receiver that has the same initial state that src had after its construction. It does not copy the current count of try_puts received, or the current known number of predecessors. The continue_receiver that is constructed will only have a non-zero threshold if src was constructed with a non-zero threshold. 6.5.3 ~continue_receiver( ) Effect Destructor. 6.5.4 bool try_put( const input_type & ) Effect Increments the count of try_put calls received. If the incremented count is equal to the number of known predecessors, a call is made to execute and the internal count of try_put calls is reset to zero. This method performs as if the call to execute and the updates to the internal count occur atomically. Returns True. 6.5.5 bool register_predecessor( predecessor_type & r ) Effect Increments the number of known predecessors. Returns True. 154 315415-014US 6.5.6 bool remove_predecessor( predecessor_type & r ) Effect Decrements the number of know predecessors. CAUTION: The method execute is not called if the count of try_put calls received becomes equal to the number of known predecessors as a result of this call. That is, a call to remove_predecessor will never call execute. 6.5.7 void execute() = 0 Description A pure virtual method that is called when the number of try_put calls is equal to the number of known predecessors. Must be overridden by the child class. CAUTION: This method should be very fast or else enqueue a task to offload its work, since this method is called while the sender is blocked on try_put. 6.6 graph_node Class Summary A base class for all graph nodes. Syntax class graph_node; Header #include "tbb/flow_graph.h" Description The class graph_node is a base class for all flow graph nodes. The virtual destructor allows flow graph nodes to be destroyed through pointers to graph_node. For example, a vector< graph_node * > could be used to hold the addresses of flow graph nodes that will later need to be destroyed. Members namespace tbb { namespace flow { Flow Graph Reference Manual 155 class graph_node { public: virtual ~graph_node() {} }; } } 6.7 continue_node Template Class Summary A template class that is a graph_node, continue_receiver and a sender. It executes a specified body object when triggered and broadcasts the generated value to all of its successors. Syntax template< typename Output > class continue_node; Header #include "tbb/flow_graph.h" Description This type is used for nodes that wait for their predecessors to complete before executing, but no explicit data is passed across the incoming edges. The output of the node can be a continue_msg or a value. An continue_node maintains an internal threshold, T, and an internal counter, C. If a value for the number of predecessors is provided at construction, then T is set to the provided value and C=0. Otherwise, C=T=0. At each call to method register_predecessor, the threshold T is incremented. At each call to method remove_predecessor, the threshold T is decremented. The functions make_edge and remove_edge appropriately call register_predecessor and remove_predecessor when edges are added to or removed from a continue_node. At each call to method try_put, C is incremented. If after the increment, C>=T, then C is reset to 0 and a task is enqueued to broadcast the result of body() to all successors. The increment of C, enqueueing of the task, and the resetting of C are all done atomically with respect to the node. If after the increment, C Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void18F 19 operator=( const B& ) Assignment Output B::operator()(const continue_msg &v) const Perform operation and return value of type Output. CAUTION: The body object passed to a continue_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 990H Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { template< typename Output > class continue_node : public graph_node, public continue_receiver, public sender { public: template continue_node( graph &g, Body body ); template continue_node( graph &g, int number_of_predecessors, Body body ); continue_node( const continue_node& src ); // continue_receiver typedef continue_msg input_type; 19 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. Flow Graph Reference Manual 157 typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef Output output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.7.1 template< typename Body> continue_node(graph &g, Body body) Effect Constructs an continue_node that will invoke body. 6.7.2 template< typename Body> continue_node(graph &g, int number_of_predecessors, Body body) Effect Constructs an continue_node that will invoke body. The threshold T is initialized to number_of_predecessors. 6.7.3 continue_node( const continue_node & src ) Effect Constructs a continue_node that has the same initial state that src had after its construction. It does not copy the current count of try_puts received, or the current known number of predecessors. The continue_node that is constructed will have a 158 315415-014US reference to the same graph object as src, have a copy of the initial body used by src, and only have a non-zero threshold if src was constructed with a non-zero threshold. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new continue_node. 6.7.4 bool register_predecessor( predecessor_type & r ) Effect Increments the number of known predecessors. Returns True. 6.7.5 bool remove_predecessor( predecessor_type & r ) Effect Decrements the number of know predecessors. CAUTION: The body is not called if the count of try_put calls received becomes equal to the number of known predecessors as a result of this call. That is, a call to remove_predecessor will never invoke the body. 6.7.6 bool try_put( const input_type & ) Effect Increments the count of try_put calls received. If the incremented count is equal to the number of known predecessors, a task is enqueued to execute the body and the internal count of try_put calls is reset to zero. This method performs as if the enqueueing of the body task and the updates to the internal count occur atomically. It does not wait for the execution of the body to complete. Returns True. Flow Graph Reference Manual 159 6.7.7 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns True. 6.7.8 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns True. 6.7.9 bool try_get( output_type &v ) Description The continue_node does not contain buffering. Therefore it always rejects try_get calls. Returns False. 6.7.10 bool try_reserve( output_type & ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 160 315415-014US 6.7.11 bool try_release( ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 6.7.12 bool try_consume( ) Description The continue_node does not contain buffering. Therefore it cannot be reserved. Returns False. 6.8 function_node Template Class Summary A template class that is a graph_node, receiver and a sender. This node may have concurrency limits as set by the user. By default, a function_node has an internal FIFO buffer at its input. Messages that cannot be immediately processed due to concurrency limits are temporarily stored in this FIFO buffer. A template argument can be used to disable this internal buffer. If the FIFO buffer is disabled, incoming message will be rejected if they cannot be processed immediately while respecting the concurreny limits of the node. Syntax template < typename Input, typename Output = continue_msg, graph_buffer_policy = queueing, typename Allocator=cache_aligned_allocator > class function_node; Header #include "tbb/flow_graph.h" Flow Graph Reference Manual 161 Description A function_node receives messages of type Input at a single input port and generates a single output message of type Output that is broadcast to all successors. Rejection of messages by successors is handled using the protocol in Figure 4. 991H If graph_buffer_policy==queueing, an internal unbounded input buffer is maintained using memory obtained through an allocator of type Allocator. A function_node maintains an internal constant threshold T and an internal counter C. At construction, C=0 and T is set the value passed in to the constructor. The behavior of a call to try_put is determined by the value of T and C as shown in Table 23. 992H Table 23: Behavior of a call to a function_node’s try_put Value of threshold T Value of counter C bool try_put( input_type v ) T == flow::unlimited NA A task is enqueued that broadcasts the result of body(v) to all successors. Returns true. T != flow::unlimited C < T Increments C. A task is enqueued that broadcasts the result of body(v) to all successors and then decrements C. Returns true. T != flow::unlimited C >= T If the template argument graph_buffer_policy==queueing, v is stored in an internal FIFO buffer until C < T. When T becomes less than C, C is incremented and a task is enqueued that broadcasts the result of body(v) to all successors and then decrements C. Returns true. If the template argument graph_buffer_policy==rejectin g and C >= T, returns false. A function_node has a user-settable concurrency limit. It can have flow::unlimited concurrency, which allows an unlimited number of invocations of the body to execute concurrently. It can have flow::serial concurrency, which allows only a single call of body to execute concurrently. The user can also provide a value of type size_t to limit concurrency to a value between 1 and unlimited. A function_node with graph_buffer_policy==rejecting will maintain a predecessor set as described in Figure 4. If the 993H function_node transitions from a state where C >= T to a state where C < T, it will try to get new messages from its set of predecessors until C >= T or there are no valid predecessors left in the set. NOTE: A function_node can serve as a terminal node in the graph. The convention is to use an Output of continue_msg and attach no successor. 162 315415-014US The Body concept for function_node is shown in Table 24 994H . Table 24: function_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void19F 20 operator=( const B& ) Assignment Output B::operator()(const Input &v) const Perform operation on v and return value of type OutputType. CAUTION: The body object passed to a function_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. If the state held within a body object must be inspected from outside of the node, the copy_body function described in 6.22 can be used to obtain an updated copy. 995H Input and Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { enum graph_buffer_policy { rejecting, reserving, queueing, tag_matching }; template < typename Input, typename Output = continue_msg, graph_buffer_policy = queueing, typename Allocator=cache_aligned_allocator > class function_node : public graph_node, public receiver, public sender { public: template function_node( graph &g, size_t concurrency, Body body ); function_node( const function_node &src ); // receiver typedef Input input_type; typedef sender predecessor_type; 20 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. Flow Graph Reference Manual 163 bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef Output output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.8.1 template< typename Body> function_node(graph &g, size_t concurrency, Body body) Description Constructs a function_node that will invoke a copy of body. At most concurrency calls to body may be made concurrently. 6.8.2 function_node( const function_node &src ) Effect Constructs a function_node that has the same initial state that src had when it was constructed. The function_node that is constructed will have a reference to the same graph object as src, will have a copy of the initial body used by src, and have the same concurrency threshold as src. The predecessors and successors of src will not be copied. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new function_node.164 315415-014US 6.8.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. 6.8.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. 6.8.5 bool try_put( const input_type &v ) Effect See Table 23 for a description of the behavior of 996H try_put. Returns true. 6.8.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 165 6.8.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.8.8 bool try_get( output_type &v ) Description A function_node does not contain buffering of its output. Therefore it always rejects try_get calls. Returns false. 6.8.9 bool try_reserve( output_type & ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 6.8.10 bool try_release( ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 166 315415-014US 6.8.11 bool try_consume( ) Description A function_node does not contain buffering of its output. Therefore it cannot be reserved. Returns false. 6.9 source_node Class Summary A template class that is both a graph_node and a sender. This node can have no predecessors. It executes a user-provided body function object to generate messages that are broadcast to all successors. It is a serial node and will never call its body concurrently. It is able to buffer a single item. If no successor accepts an item that it has generated, the message is buffered and will be provided to successors before a new item is generated. Syntax template < typename OutputType > class source_node; Header #include "tbb/flow_graph.h" Description This type of node generates messages of type Output by invoking the user-provided body and broadcasts the result to all of its successors. Output must be copy-constructible and assignable. A source_node is a serial node. Calls to body will never be made concurrently. A source_node will continue to invoke body and broadcast messages until the body returns false or it has no valid successors. A message may be generated and then rejected by all successors. In that case, the message is buffered and will be the next message sent once a successor is added to the node or try_get is called. Calls to try_get will return a buffer message if available or will invoke body to attempt to generate a new message. A call to body is made only when the internal buffer is empty. Rejection of messages by successors is handled using the protocol in Figure 4. 997HFlow Graph Reference Manual 167 Table 25: source_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void20F 21 operator=( const B& ) Assignment bool B::operator()(Output &v) Returns true when it has assigned a new value to v. Returns false when no new values may be generated. CAUTION: The body object passed to a source_node is copied. Therefore updates to member variables will not affect the original object used to construct the node. Output must be copy-constructible and assignable. Members namespace tbb { namespace flow { template < typename Output > class source_node : public graph_node, public sender< Output > { public: typedef Output output_type; typedef receiver< output_type > successor_type; template< typename Body > source_node( graph &g, Body body, bool is_active = true ); source_node( const source_node &src ); ~source_node(); void activate(); bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type &v ); bool try_release( ); bool try_consume( ); }; 21 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. 168 315415-014US } } 6.9.1 template< typename Body> source_node(graph &g, Body body, bool is_active=true) Description Constructs a source_node that will invoke body. By default the node is created in the active state, that is, it will begin generating messages immediately. If is_active is false, messages will not be generated until a call to activate is made. 6.9.2 source_node( const source_node &src ) Description Constructs a source_node that has the same initial state that src had when it was constructed. The source_node that is constructed will have a reference to the same graph object as src, will have a copy of the initial body used by src, and have the same initial active state as src. The predecessors and successors of src will not be copied. CAUTION: The new body object is copy constructed from a copy of the original body provided to src at its construction. Therefore changes made to member variables in src’s body after the construction of src will not affect the body of the new source_node. 6.9.3 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 169 6.9.4 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.9.5 bool try_get( output_type &v ) Description Will copy the buffered message into v if available or will invoke body to attempt to generate a new message that will be copied into v. Returns true if a message is copied to v. false otherwise. 6.9.6 bool try_reserve( output_type &v ) Description Reserves the source_node if possible. If a message can be buffered and the node is not already reserved, the node is reserved for the caller and the value is copied into v. Returns true if the node is reserved for the caller. false otherwise. 6.9.7 bool try_release( ) Description Releases any reservation held on the source_node. The message held in the internal buffer is retained. Returns true 170 315415-014US 6.9.8 bool try_consume( ) Description Releases any reservation held on the source_node and clears the internal buffer. Returns true 6.10 overwrite_node Template Class Summary A template class that is a graph_node, receiver and sender. An overwrite_node is a buffer of a single item that can be over-written. The value held in the buffer is initially invalid. Gets from the node are non-destructive. Syntax template < typename T > class overwrite_node; Header #include "tbb/flow_graph.h" Description This type of node buffers a single item of type T. The value is initially invalid. A try_put will set the value of the internal buffer, and broadcast the new value to all successors. If the internal value is valid, a try_get will return true and copy the buffer value to the output. If the internal value is invalid, try_get will return false. Rejection of messages by successors is handled using the protocol in Figure 4. 998H T must be copy-constructible and assignable Members namespace tbb { namespace flow { template< typename T > class overwrite_node : public graph_node, public receiver, public sender { public: overwrite_node(); overwrite_node( const overwrite_node &src ); Flow Graph Reference Manual 171 ~overwrite_node(); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); bool is_valid(); void clear(); }; } } 6.10.1 overwrite_node() Effect Constructs an object of type overwrite_node with an invalid internal buffer item. 6.10.2 overwrite_node( const overwrite_node &src ) Effect Constructs an object of type overwrite_node with an invalid internal buffer item. The buffered value and list of successors is NOT copied from src. 172 315415-014US 6.10.3 ~overwrite_node() Effect Destroys the overwrite_node. 6.10.4 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.10.5 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.10.6 bool try_put( const input_type &v ) Effect Stores v in the internal single item buffer. Calls try_put( v ) on all successors. Returns true. Flow Graph Reference Manual 173 6.10.7 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. If a valid item v is held in the buffer, a task is enqueued to call r.try_put(v). Returns true. 6.10.8 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.10.9 bool try_get( output_type &v ) Description If the internal buffer is valid, assigns the value to v. Returns true if v is assigned to. false if v is not assigned to. 6.10.10 bool try_reserve( output_type & ) Description Does not support reservations. Returns false. 174 315415-014US 6.10.11 bool try_release( ) Description Does not support reservations. Returns false. 6.10.12 bool try_consume( ) Description Does not support reservations. Returns false. 6.10.13 bool is_valid() Returns Returns true if the buffer holds a valid value, otherwise returns false. 6.10.14 void clear() Effect Invalidates the value held in the buffer. 6.11 write_once_node Template Class Summary A template class that is a graph_node, receiver and sender. A write_once_node represents a buffer of a single item that cannot be over-written. The first put to the node sets the value. The value may be cleared explicitly, after which a new value may be set. Gets from the node are non-destructive. Rejection of messages by successors is handled using the protocol in Figure 4. 999HFlow Graph Reference Manual 175 T must be copy-constructible and assignable Syntax template < typename T > class write_once_node; Header #include "tbb/flow_graph.h" Members namespace tbb { namespace flow { template< typename T > class write_once_node : public graph_node, public receiver, public sender { public: write_once_node(); write_once_node( const write_once_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); bool is_valid(); void clear(); }; } } 176 315415-014US 6.11.1 write_once_node() Effect Constructs an object of type write_once_node with an invalid internal buffer item. 6.11.2 write_once_node( const write_once_node &src ) Effect Constructs an object of type write_once_node with an invalid internal buffer item. The buffered value and list of successors is NOT copied from src. 6.11.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.11.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.11.5 bool try_put( const input_type &v ) Effect Stores v in the internal single item buffer if it does not already contain a valid value. If a new value is set, it calls try_put( v ) on all successors. Flow Graph Reference Manual 177 Returns true. 6.11.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. If a valid item v is held in the buffer, a task is enqueued to call r.try_put(v). Returns true. 6.11.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.11.8 bool try_get( output_type &v ) Description If the internal buffer is valid, assigns the value to v. Returns true if v is assigned to. false if v is not assigned to. 6.11.9 bool try_reserve( output_type & ) Description Does not support reservations. 178 315415-014US Returns false. 6.11.10 bool try_release( ) Description Does not support reservations. Returns false. 6.11.11 bool try_consume( ) Description Does not support reservations. Returns false. 6.11.12 bool is_valid() Returns Returns true if the buffer holds a valid value, otherwise returns false. 6.11.13 void clear() Effect Invalidates the value held in the buffer. 6.12 broadcast_node Template Class Summary A node that broadcasts incoming messages to all of its successors. Flow Graph Reference Manual 179 Syntax template < typename T > class broadcast_node; Header #include "tbb/flow_graph.h" Description A broadcast_node is a graph_node, receiver and sender that broadcasts incoming messages of type T to all of its successors. There is no buffering in the node, so all messages are forwarded immediately to all successors. Rejection of messages by successors is handled using the protocol in Figure 4. 1000H T must be copy-constructible and assignable Members namespace tbb { namespace flow { template< typename T > class broadcast_node : public graph_node, public receiver, public sender { public: broadcast_node(); broadcast_node( const broadcast_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } 180 315415-014US } 6.12.1 broadcast_node() Effect Constructs an object of type broadcast_node. 6.12.2 broadcast_node( const broadcast_node &src ) Effect Constructs an object of type broadtcast_node. The list of successors is NOT copied from src. 6.12.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.12.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. Flow Graph Reference Manual 181 6.12.5 bool try_put( const input_type &v ) Effect Broadcasts v to all successors. Returns Always returns true, even if it was unable to successfully forward the message to any of its successors. 6.12.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.12.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.12.8 bool try_get( output_type & ) Returns false. 182 315415-014US 6.12.9 bool try_reserve( output_type & ) Returns false. 6.12.10 bool try_release( ) Returns false. 6.12.11 bool try_consume( ) Returns false. 6.13 buffer_node Class Summary An unbounded buffer of messages of type T. Messages are forwarded in arbitrary order. Syntax template< typename T, typename A=cache_aligned_allocator > class buffer_node; Header #include "tbb/flow_graph.h" Description A buffer_node is a graph_node, receiver and sender that forwards messages in arbitrary order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list according to the policy in Figure 4 and the next 1001H successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. Flow Graph Reference Manual 183 A buffer_node is reservable and supports a single reservation at a time. While an item is reserved, other items may still be forwarded to successors and try_get calls will return other non-reserved items if available. While an item is reserved, try_put will still return true and add items to the buffer. An allocator of type A is used to allocate internal memory for the buffer_node. T must be copy-constructible and assignable Rejection of messages by successors is handled using the protocol in Figure 4. 1002H Members namespace tbb { namespace flow { template< typename T, typename A=cache_aligned_allocator > class buffer_node : public graph_node, public receiver, public sender { public: buffer_node( graph &g ); buffer_node( const buffer_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 184 315415-014US 6.13.1 buffer_node( graph& g ) Effect Constructs an empty buffer_node that belongs to graph g. 6.13.2 buffer_node( const buffer_node &src ) Effect Constructs an empty buffer_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.13.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.13.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.13.5 bool try_put( const input_type &v ) Effect Adds v to the buffer. If v is the only item in the buffer, a task is also enqueued to forward the item to a successor. Flow Graph Reference Manual 185 Returns true. 6.13.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.13.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.13.8 bool try_get( output_type & v ) Returns Returns true if an item can be removed from the buffer and assigned to v. Returns false if there is no non-reserved item currently in the buffer. 6.13.9 bool try_reserve( output_type & v ) Effect Assigns a newly reserved item to v if there is no reservation currently held and there is at least one item available in the buffer. If a new reservation is made, the buffer is marked as reserved. Returns Returns true if v has been assigned a newly reserved item. Returns false otherwise. 186 315415-014US 6.13.10 bool try_release( ) Effect Releases the reservation on the buffer. The item that was returned in the last successful call to try_reserve remains in the buffer. Returns Returns true if the buffer is currently reserved and false otherwise. 6.13.11 bool try_consume( ) Effect Releases the reservation on the buffer. The item that was returned in the last successful call to try_reserve is removed from the buffer. Returns Returns true if the buffer is currently reserved and false otherwise. 6.14 queue_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in first-in first-out (FIFO) order. Syntax template > class queue_node; Header #include "tbb/flow_graph.h" Description A queue_node is a graph_node, receiver and sender that forwards messages in first-in first-out (FIFO) order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 1003H 4 and the next successor in the set is tried. This continues until a successor accepts Flow Graph Reference Manual 187 the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. A queue_node is reservable and supports a single reservation at a time. While the queue_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the queue_node. An allocator of type A is used to allocate internal memory for the queue_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1004H Members namespace tbb { namespace flow { template > class queue_node : public buffer_node { public: queue_node( graph &g ); queue_node( const queue_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 188 315415-014US 6.14.1 queue_node( graph& g ) Effect Constructs an empty queue_node that belongs to graph g. 6.14.2 queue_node( const queue_node &src ) Effect Constructs an empty queue_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.14.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.14.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.14.5 bool try_put( const input_type &v ) Effect Adds v to the queue_node. If v is the only item in the queue_node, a task is enqueued to forward the item to a successor. Flow Graph Reference Manual 189 Returns true. 6.14.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.14.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.14.8 bool try_get( output_type & v ) Returns Returns true if an item can be removed from the front of the queue_node and assigned to v. Returns false if there is no item currently in the queue_node or if the node is reserved. 6.14.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. 190 315415-014US Returns Returns true if there is an item in the queue_node and the node is not currently reserved. If an item can be returned, it is assigned to v. Returns false if there is no item currently in the queue_node or if the node is reserved. 6.14.10 bool try_release( ) Effect Release the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the queue_node. Returns Returns true if the node is currently reserved and false otherwise. 6.14.11 bool try_consume( ) Effect Releases the reservation on the queue_node. The item that was returned in the last successful call to try_reserve is popped from the front of the queue. Returns Returns true if the queue_node is currently reserved and false otherwise. 6.15 priority_queue_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in priority order. Syntax template< typename T, typename Compare = std::less, typename A=cache_aligned_allocator > class priority_queue_node;Flow Graph Reference Manual 191 Header #include "tbb/flow_graph.h" Description A priority_queue_node is a graph_node, receiver and sender that forwards messages in priority order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 4 1005H and the next successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. The next message to be forwarded has the largest priority as determined by Compare. A priority_queue_node is reservable and supports a single reservation at a time. While the priority_queue_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the priority_queue_node. An allocator of type A is used to allocate internal memory for the priority_queue_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1006H Members namespace tbb { namespace flow { template< typename T, typename Compare = std::less, typename A=cache_aligned_allocator> class priority_queue_node : public queue_node { public: typedef size_t size_type; priority_queue_node( graph &g ); priority_queue_node( const priority_queue_node &src ); ~priority_queue_node(); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); 192 315415-014US // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.15.1 priority_queue_node( graph& g) Effect Constructs an empty priority_queue_node that belongs to graph g. 6.15.2 priority_queue_node( const priority_queue_node &src ) Effect Constructs an empty priority_queue_node that belongs to the same graph g as src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. 6.15.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. Flow Graph Reference Manual 193 6.15.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.15.5 bool try_put( const input_type &v ) Effect Adds v to the priority_queue_node. If v‘s priority is the largest of all of the currently buffered messages, a task is enqueued to forward the item to a successor. Returns true. 6.15.6 bool register_successor( successor_type &r ) Effect Adds r to the set of successors. Returns true. 6.15.7 bool remove_successor( successor_type &r ) Effect Removes r from the set of successors. Returns true. 194 315415-014US 6.15.8 bool try_get( output_type & v ) Returns Returns true if a message is available in the node and the node is not currently reserved. Otherwise returns false. If the node returns true, the message with the largest priority will have been copied to v. 6.15.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. Returns Returns true if a message is available in the node and the node is not currently reserved. Otherwise returns false. If the node returns true, the message with the largest priority will have been copied to v. 6.15.10 bool try_release( ) Effect Release the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the priority_queue_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.15.11 bool try_consume( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve is removed from the priority_queue_node. Returns Returns true if the buffer is currently reserved and false otherwise. Flow Graph Reference Manual 195 6.16 sequencer_node Template Class Summary An unbounded buffer of messages of type T. Messages are forwarded in sequence order. Syntax template< typename T, typename A=cache_aligned_allocator > class sequencer_node; Header #include "tbb/flow_graph.h" Description A sequencer_node is a graph_node, receiver and sender that forwards messages in sequence order to a single successor in its successor set. Successors are tried in the order that they were registered with the node. If a successor rejects the message, it is removed from the successor list as described by the policy in Figure 4 1007H and the next successor in the set is tried. This continues until a successor accepts the message, or all successors have been attempted. Items that are successfully transferred to a successor are removed from the buffer. Each item that passes through a sequencer_node is ordered by its sequencer order number. These sequence order numbers range from 0 … N, where N is the largest integer representable by the size_t type. An item’s sequencer order number is determined by passing the item to a user-provided function object that models the Sequencer Concept shown in Table 26. 1008H Table 26: sequencer_node Sequencer Concept Pseudo-Signature Semantics S::S( const S& ) Copy constructor. S::~S() Destructor. void21F 22 operator=( const S& ) Assignment size_t S::operator()( const T &v ) Returns the sequence number for the provided message v. 22 The return type void in the pseudo-signature denotes that operator= is not required to return a value. The actual operator= can return a value, which will be ignored. 196 315415-014US A sequencer_node is reservable and supports a single reservation at a time. While a sequencer_node is reserved, no other items will be forwarded to successors and all try_get calls will return false. While reserved, try_put will still return true and add items to the sequencer_node. An allocator of type A is used to allocate internal memory for the sequencer_node. T must be copy-constructible and assignable. Rejection of messages by successors is handled using the protocol in Figure 4. 1009H Members namespace tbb { namespace flow { template< typename T, typename A=cache_aligned_allocator > class sequencer_node : public queue_node { public: template< typename Sequencer > sequencer_node( graph &g, const Sequencer& s ); sequencer_node( const sequencer_node &src ); // receiver typedef T input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } Flow Graph Reference Manual 197 6.16.1 template sequencer_node( graph& g, const Sequencer& s ) Effect Constructs an empty sequencer_node that belongs to graph g and uses s to compute sequence numbers for items. 6.16.2 sequencer_node( const sequencer_node &src ) Effect Constructs an empty sequencer_node that belongs to the same graph g as src and will use a copy of the Sequencer s used to construct src. The list of predecessors, the list of successors and the messages in the buffer are NOT copied. CAUTION: The new Sequencer object is copy constructed from a copy of the original Sequencer object provided to src at its construction. Therefore changes made to member variables in src’s object will not affect the Sequencer of the new sequencer_node. 6.16.3 bool register_predecessor( predecessor_type & ) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 6.16.4 bool remove_predecessor( predecessor_type &) Description Never rejects puts and therefore does not need to maintain a list of predecessors. Returns false. 198 315415-014US 6.16.5 bool try_put( input_type v ) Effect Adds v to the sequencer_node. If v‘s sequence number is the next item in the sequence, a task is enqueued to forward the item to a successor. Returns true. 6.16.6 bool register_successor( successor_type &r ) Effect Adds r to the set of successors. Returns true. 6.16.7 bool remove_successor( successor_type &r ) Effect Removes r from the set of successors. Returns true. 6.16.8 bool try_get( output_type & v ) Returns Returns true if the next item in the sequence is available in the sequencer_node. If so, it is removed from the node and assigned to v. Returns false if the next item in sequencer order is not available or if the node is reserved. Flow Graph Reference Manual 199 6.16.9 bool try_reserve( output_type & v ) Effect If the call returns true, the node is reserved and will forward no more messages until the reservation has been released or consumed. Returns Returns true if the next item in sequencer order is available in the sequencer_node. If so, the item is assigned to v, but is not removed from the sequencer_node Returns false if the next item in sequencer order is not available or if the node is reserved. 6.16.10 bool try_release( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve remains in the sequencer_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.16.11 bool try_consume( ) Effect Releases the reservation on the node. The item that was returned in the last successful call to try_reserve is removed from the sequencer_node. Returns Returns true if the buffer is currently reserved and false otherwise. 6.17 limiter_node Template Class Summary An node that counts and limits the number of messages that pass through it. Syntax template < typename T > class limiter_node;200 315415-014US Header #include "tbb/flow_graph.h" Description A limiter_node is a graph_node, receiver and sender that broadcasts messages to all of its successors. It keeps a counter C of the number of broadcasts it makes and does not accept new messages once its user-specified threshold T is reached. The internal count of broadcasts C can be decremented through use of its embedded continue_receiver decrement. The behavior of a call to a limiter_node’s try_put is shown in Table 27. 1010H Table 27: Behavior of a call to a limiter_node’s try_put Value of counter C bool try_put( input_type v ) C < T C is incremented and v is broadcast to all successors. If no successor accepts the message, C is decremented. Returns true if the message was successfully broadcast to at least one successor and false otherwise. C == T Returns false. When try_put is called on the member object decrement, the limiter_node will try to get a message from one of its known predecessors and forward that message to all of its successors. If it cannot obtain a message from a predecessor, it will decrement C. Rejection of messages by successors and failed gets from predecessors are handled using the protocol in Figure 4. 1011H T must be copy-constructible and assignable. Members namespace tbb { namespace flow { template< typename T > class limiter_node : public graph_node, public receiver, public sender { public: limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors = 0 ); limiter_node( const limiter_node &src ); // a continue_receiver implementation-dependent-type decrement; // receiver typedef T input_type; Flow Graph Reference Manual 201 typedef sender predecessor_type; bool try_put( const input_type &v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); // sender typedef T output_type; typedef receiver successor_type; bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } 6.17.1 limiter_node( graph &g, size_t threshold, int number_of_decrement_predecessors ) Description Constructs a limiter_node that allows up to threshold items to pass through before rejecting try_puts. Optionally a number_of_decrement_predecessors value can be supplied. This value is passed on to the continue_receiver decrement’s constructor. 6.17.2 limiter_node( const limiter_node &src ) Description Constructs a limiter_node that has the same initial state that src had at its construction. The new limiter_node will belong to the same graph g as src, have the same threshold, and have the same initial number_of_decrement_predecessors. The list of predecessors, the list of successors and the current count of broadcasts, C, are NOT copied from src. 202 315415-014US 6.17.3 bool register_predecessor( predecessor_type& p ) Description Adds a predecessor that can be pulled from once the broadcast count falls below the threshold. Effect Adds p to the set of predecessors. Returns true. 6.17.4 bool remove_predecessor( predecessor_type & r ) Effect Removes p to the set of predecessors. Returns true. 6.17.5 bool try_put( input_type &v ) Effect If the broadcast count is below the threshold, v is broadcast to all successors. For each successor s, if s.try_put( v ) == false && s.register_predecessor( *this ) == true, then s is removed from the set of succesors. Otherwise, s will remain in the set of successors. Returns true if v is broadcast. false if v is not broadcast because the threshold has been reached. Flow Graph Reference Manual 203 6.17.6 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. 6.17.7 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.17.8 bool try_get( output_type & ) Description Does not contain buffering and therefore cannot be pulled from. Returns false. 6.17.9 bool try_reserve( output_type & ) Description Does not support reservations. Returns false. 204 315415-014US 6.17.10 bool try_release( ) Description Does not support reservations. Returns false. 6.17.11 bool try_consume( ) Description Does not support reservations. Returns false. 6.18 join_node Template Class Summary A node that creates a tuple from a set of messages received at its input ports and broadcasts the tuple to all of its successors. The class join_node supports three buffering policies at its input ports: reserving, queueing and tag_matching. By default, join_node input ports use the queueing policy. Syntax template class join_node; Header #include "tbb/flow_graph.h" Description A join_node is a graph_node and a sender< std::tuple< T0, T1, … >. It contains a tuple of input ports, each of which is a receiver for each of the T0 .. TN in OutputTuple. It supports multiple input receivers with distinct types and broadcasts a tuple of received messages to all of its successors. All input ports of a join_node must use the same buffering policy. The behavior of a join_node based on its buffering policy is shown in Table 28. 1012HFlow Graph Reference Manual 205 Table 28: Behavior of a join_node based on the buffering policy of its input ports. Buffering Policy Behavior queueing As each input port is put to, the incoming message is added to an unbounded first-in first-out queue in the port. When there is at least one message at each input port, the join_node broadcasts a tuple containing the head of each queue to all successors. If at least one successor accepts the tuple, the head of each input port’s queue is removed, otherwise the messages remain in their respective input port queues. reserving As each input port is put to, the join_node marks that an input may be available at that port and returns false. When all ports have been marked as possibly available, the join_node will try to reserve a message at each port from their known predecessors. If it is unable to reserve a message at a port, it un-marks that port, and releases all previously acquired reservations. If it is able to reserve a message at all ports, it broadcasts a tuple containing these messages to all successors. If at least one successor accepts the tuple, the reservations are consumed; otherwise, they are released. tag_matching As each input port is put to, a user-provided function object is applied to the message to obtain its tag. The message is then added to a hash table at the input port, using the tag as the key. When there is message at each input port for a given tag, the join_node broadcasts a tuple containing the matching messages to all successors. If at least one successor accepts the tuple, the messages are removed from each input port’s hash table; otherwise, the messages remain in their respective input ports. Rejection of messages by successors of the join_node and failed gets from predecessors of the input ports are handled using the protocol in Figure 4. 1013H The function template input_port described in 6.19 simplifies the syntax for getting a 1014H reference to a specific input port. OutputTuple must be a std::tuple where each element is copyconstructible and assignable. Example #include #include "tbb/flow_graph.h" using namespace tbb::flow; int main() { graph g; function_node f1( g, unlimited, [](const int &i) { return 2*i; } ); function_node f2( g, unlimited, 206 315415-014US [](const float &f) { return f/2; } ); join_node< std::tuple > j(g); function_node< std::tuple > f3( g, unlimited, []( const std::tuple &t ) { printf("Result is %f\n", std::get<0>(t) + std::get<1>(t));}); make_edge( f1, input_port<0>(j) ); make_edge( f2, input_port<1>(j) ); make_edge( j, f3 ); f1.try_put( 3 ); f2.try_put( 3 ); g.wait_for_all(); return 0; } In the example above, three function_node objects are created: f1 multiplies an int i by 2, f2 divides a float f by 2, and f3 receives a std::tuple t, adds its elements together and prints the result. The join_node j combines the output of f1 and f2 and forwards the resulting tuple to f3. This example is purely a syntactic demonstration since there is very little work in the nodes. Members namespace tbb { namespace flow { enum graph_buffer_policy { rejecting, reserving, queueing, tag_matching }; template class join_node : public graph_node, public sender< OutputTuple > { public: typedef OutputTuple output_type; typedef receiver successor_type; implementation-dependent-tuple input_ports_tuple_type; join_node(graph &g); join_node(const join_node &src); input_ports_tuple_type &inputs(); Flow Graph Reference Manual 207 bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; // // Specialization for tag_matching // template class join_node : public graph_node, public sender< OutputTuple > { public: // Has the same methdods as previous join_node, // but has constructors to specify the tag_matching // function objects template join_node(graph &g, B0 b0, B1 b1); // Constructors are defined similarly for // 3 through 10 elements … }; } } 6.18.1 join_node( graph &g ) Effect Creates a join_node that will enqueue tasks using the root task in g. 208 315415-014US 6.18.2 template < typename B0, typename B1, … > join_node( graph &g, B0 b0, B1 b1, … ) Description A constructor only available in the tag_matching specialization of join_node. Effect Creates a join_node that uses the function objects b0, b1, …, bN to determine that tags for the input ports 0 through N. It will enqueue tasks using the root task in g. 6.18.3 join_node( const join_node &src ) Effect Creates a join_node that has the same initial state that src had at its construction. The list of predecessors, messages in the input ports, and successors are NOT copied. 6.18.4 input_ports_tuple_type& inputs() Returns A std::tuple of receivers. Each element inherits from tbb::receiver where T is the type of message expected at that input. Each tuple element can be used like any other flow::receiver. The behavior of the ports based on the selected join_node policy is shown in Table 28. 1015H 6.18.5 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. Flow Graph Reference Manual 209 6.18.6 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. 6.18.7 bool try_get( output_type &v ) Description Attempts to generate a tuple based on the buffering policy of the join_node. Returns If it can successully generate a tuple, it copies it to v and returns true. Otherwise it returns false. 6.18.8 bool try_reserve( T & ) Description A join_node cannot be reserved. Returns false. 6.18.9 bool try_release( ) Description A join_node cannot be reserved. Returns false. 210 315415-014US 6.18.10 bool try_consume( ) Description A join_node cannot be reserved. Returns false. 6.18.11 template typename std::tuple_element::type &input_port(JNT &jn) Description Equivalent to calling std::get( jn.inputs() ) Returns Returns the N th input port for join_node jn. 6.19 input_port Template Function Summary A template function that given a join_node or or_node returns a reference to a specific input port. Syntax template typename std::tuple_element::type& input_port(NT &n); Header #include "tbb/flow_graph.h" Flow Graph Reference Manual 211 6.20 make_edge Template Function Summary A template function that adds an edge between a sender and a receiver. Syntax template< typename T > inline void make_edge( sender &p, receiver &s ); Header #include "tbb/flow_graph.h" 6.21 remove_edge Template Function Summary A template function that removes an edge between a sender and a receiver. Syntax template< typename T > void remove_edge( sender &p, receiver &s ); Header #include "tbb/flow_graph.h" 6.22 copy_body Template Function Summary A template function that returns a copy of the body function object from a continue_node or function_node. Syntax template< typename Body, typename Node > Body copy_body( Node &n ); Header #include "tbb/flow_graph.h" 212 315415-014US 7 Thread Local Storage Intel® Threading Building Blocks (Intel® TBB) provides two template classes for thread local storage. Both provide a thread-local element per thread. Both lazily create the elements on demand. They differ in their intended use models: combinable provides thread-local storage for holding per-thread subcomputations that will later be reduced to a single result. It is PPL compatible. enumerable_thread_specific provides thread-local storage that acts like a STL container with one element per thread. The container permits iterating over the elements using the usual STL iteration idioms. This chapter also describes template class flatten2d, which assists a common idiom where an enumerable_thread_specific represents a container partitioner across threads. 7.1 combinable Template Class Summary Template class for holding thread-local values during a parallel computation that will be merged into to final. Syntax template class combinable; Header #include "tbb/combinable.h" Description A combinable provides each thread with its own local instance of type T. Members namespace tbb { template class combinable { public: combinable(); template Thread Local Storage Reference Manual 213 combinable(FInit finit);} combinable(const combinable& other); ~combinable(); combinable& operator=( const combinable& other); void clear(); T& local(); T& local(bool & exists); template T combine(FCombine fcombine); template void combine_each(Func f); }; } 7.1.1 combinable() Effects Constructs combinable such that any thread-local instances of T will be created using default construction. 7.1.2 template combinable(FInit finit) Effects Constructs combinable such that any thread-local element will be created by copying the result of finit(). NOTE: The expression finit() must be safe to evaluate concurrently by multiple threads. It is evaluated each time a thread-local element is created. 7.1.3 combinable( const combinable& other ); Effects Construct a copy of other, so that it has copies of each element in other with the same thread mapping. 214 315415-014US 7.1.4 ~combinable() Effects Destroy all thread-local elements in *this. 7.1.5 combinable& operator=( const combinable& other ) Effects Set *this to be a copy of other. 7.1.6 void clear() Effects Remove all elements from *this. 7.1.7 T& local() Effects If thread-local element does not exist, create it. Returns Reference to thread-local element. 7.1.8 T& local( bool& exists ) Effects Similar to local(), except that exists is set to true if an element was already present for the current thread; false otherwise. Returns Reference to thread-local element. Thread Local Storage Reference Manual 215 7.1.9 templateT combine(FCombine fcombine) Requires Parameter fcombine should be an associative binary functor with the signature T(T,T) or T(const T&,const T&). Effects Computes reduction over all elements using binary functor fcombine. If there are no elements, creates the result using the same rules as for creating a thread-local element. Returns Result of the reduction. 7.1.10 template void combine_each(Func f) Requires Parameter f should be a unary functor with the signature void(T) or void(const T&). Effects Evaluates f(x) for each instance x of T in *this. 7.2 enumerable_thread_specific Template Class Summary Template class for thread local storage. Syntax enum ets_key_usage_type { ets_key_per_instance, ets_no_key }; template , ets_key_usage_type ETS_key_type=ets_no_key> class enumerable_thread_specific; Header #include "tbb/enumerable_thread_specific.h" Description An enumerable_thread_specific provides thread local storage (TLS) for elements of type T. An enumerable_thread_specific acts as a container by providing iterators and ranges across all of the thread-local elements. The thread-local elements are created lazily. A freshly constructed enumerable_thread_specific has no elements. When a thread requests access to a enumerable_thread_specific, it creates an element corresponding to that thread. The number of elements is equal to the number of distinct threads that have accessed the enumerable_thread_specific and not the number of threads in use by the application. Clearing a enumerable_thread_specific removes all of its elements. The ETS_key_usage_type parameter can be used to select between an implementation that consumes no native TLS keys and a specialization that offers higher performance but consumes 1 native TLS key per enumerable_thread_specific instance. If no ETS_key_usage_type parameter is provided, ets_no_key is used by default. CAUTION: The number of native TLS keys is limited and can be fairly small, for example 64 or 128. Therefore it is recommended to restrict the use of the ets_key_per_instance specialization to only the most performance critical cases. Example The following code shows a simple example usage of enumerable_thread_specific. The number of calls to null_parallel_for_body::operator() and total number of iterations executed are counted by each thread that participates in the parallel_for, and these counts are printed at the end of main. #include #include #include "tbb/task_scheduler_init.h" #include "tbb/enumerable_thread_specific.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; typedef enumerable_thread_specific< std::pair > Thread Local Storage Reference Manual 217 CounterType; CounterType MyCounters (std::make_pair(0,0)); struct Body { void operator()(const tbb::blocked_range &r) const { CounterType::reference my_counter = MyCounters.local(); ++my_counter.first; for (int i = r.begin(); i != r.end(); ++i) ++my_counter.second; } }; int main() { parallel_for( blocked_range(0, 100000000), Body()); for (CounterType::const_iterator i = MyCounters.begin(); i != MyCounters.end(); ++i) { printf("Thread stats:\n"); printf(" calls to operator(): %d", i->first); printf(" total # of iterations executed: %d\n\n", i->second); } } Example with Lambda Expressions Class enumerable_thread_specific has a method combine(f) that does reduction using binary functor f, which can be written using a lambda expression. For example, the previous example can be extended to sum the thread-local values by adding the following lines to the end of function main: std::pair sum = MyCounters.combine([](std::pair x, std::pair y) { return std::make_pair(x.first+y.first, x.second+y.second); }); printf("Total calls to operator() = %d, " "total iterations = %d\n", sum.first, sum.second); Members namespace tbb { template , ets_key_usage_type ETS_key_type=ets_no_key > class enumerable_thread_specific { public: // Basic types typedef Allocator allocator_type; typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef T* pointer; typedef implementation-dependent size_type; typedef implementation-dependent difference_type; // Iterator types typedef implementation-dependent iterator; typedef implementation-dependent const_iterator; // Parallel range types typedef implementation-dependent range_type; typedef implementation-dependent const_range_type; // Whole container operations enumerable_thread_specific(); enumerable_thread_specific( const enumerable_thread_specific &other ); template enumerable_thread_specific( const enumerable_thread_specific& other ); template enumerable_thread_specific( Finit finit ); enumerable_thread_specific(const T &exemplar); ~enumerable_thread_specific(); enumerable_thread_specific& operator=(const enumerable_thread_specific& other); template enumerable_thread_specific& operator=( const enumerable_thread_specific& other ); void clear(); Thread Local Storage Reference Manual 219 // Concurrent operations reference local(); reference local(bool& existis); size_type size() const; bool empty() const; // Combining template T combine(FCombine fcombine); template void combine_each(Func f); // Parallel iteration range_type range( size_t grainsize=1 ); const_range_type range( size_t grainsize=1 ) const; // Iterators iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; }; } 7.2.1 Whole Container Operations Safety These operations must not be invoked concurrently on the same instance of enumerable_thread_specific. 7.2.1.1 enumerable_thread_specific() Effects Constructs an enumerable_thread_specific where each local copy will be default constructed. 7.2.1.2 enumerable_thread_specific(const enumerable_thread_specific &other) Effects Copy construct an enumerable_thread_specific. The values are copy constructed from the values in other and have same thread correspondence. 220 315415-014US 7.2.1.3 template enumerable_thread_specific( const enumerable_thread_specific& other ) Effects Copy construct an enumerable_thread_specific. The values are copy constructed from the values in other and have same thread correspondence. 7.2.1.4 template< typename Finit> enumerable_thread_specific(Finit finit) Effects Constructs enumerable_thread_specific such that any thread-local element will be created by copying the result of finit(). NOTE: The expression finit() must be safe to evaluate concurrently by multiple threads. It is evaluated each time a thread-local element is created. 7.2.1.5 enumerable_thread_specific(const &exemplar) Effects Constructs an enumerable_thread_specific where each local copy will be copy constructed from exemplar. 7.2.1.6 ~enumerable_thread_specific() Effects Destroys all elements in *this. Destroys any native TLS keys that were created for this instance. 7.2.1.7 enumerable_thread_specific& operator=(const enumerable_thread_specific& other); Effects Set *this to be a copy of other. Thread Local Storage Reference Manual 221 7.2.1.8 template< typename U, typename Alloc, ets_key_usage_type Cachetype> enumerable_thread_specific& operator=(const enumerable_thread_specific& other); Effects Set *this to be a copy of other. NOTE: The allocator and key usage specialization is unchanged by this call. 7.2.1.9 void clear() Effects Destroys all elements in *this. Destroys and then recreates any native TLS keys used in the implementation. NOTE: In the current implementation, there is no performance advantage of using clear instead of destroying and reconstructing an enumerable_thread_specific. 7.2.2 Concurrent Operations 7.2.2.1 reference local() Returns A reference to the element of *this that corresponds to the current thread. Effects If there is no current element corresponding to the current thread, then constructs a new element. A new element is copy-constructed if an exemplar was provided to the constructor for *this, otherwise a new element is default constructed. 7.2.2.2 reference local( bool& exists ) Effects Similar to local(), except that exists is set to true if an element was already present for the current thread; false otherwise. Returns Reference to thread-local element. 222 315415-014US 7.2.2.3 size_type size() const Returns The number of elements in *this. The value is equal to the number of distinct threads that have called local() after *this was constructed or most recently cleared. 7.2.2.4 bool empty() const Returns size()==0 7.2.3 Combining The methods in this section iterate across the entire container. 7.2.3.1 templateT combine(FCombine fcombine) Requires Parameter fcombine should be an associative binary functor with the signature T(T,T) or T(const T&,const T&). Effects Computes reduction over all elements using binary functor fcombine. If there are no elements, creates the result using the same rules as for creating a thread-local element. Returns Result of the reduction. 7.2.3.2 template void combine_each(Func f) Requires Parameter f should be a unary functor with the signature void(T) or void(const T&). Effects Evaluates f(x) for each instance x of T in *this. Thread Local Storage Reference Manual 223 7.2.4 Parallel Iteration Types const_range_type and range_type model the Container Range concept (5.1). 1016H The types differ only in that the bounds for a const_range_type are of type const_iterator, whereas the bounds for a range_type are of type iterator. 7.2.4.1 const_range_type range( size_t grainsize=1 ) const Returns A const_range_type representing all elements in *this. The parameter grainsize is in units of elements. 7.2.4.2 range_type range( size_t grainsize=1 ) Returns A range_type representing all elements in *this. The parameter grainsize is in units of elements. 7.2.5 Iterators Template class enumerable_thread_specific supports random access iterators, which enable iteration over the set of all elements in the container. 7.2.5.1 iterator begin() Returns iterator pointing to beginning of the set of elements. 7.2.5.2 iterator end() Returns iterator pointing to end of the set of elements. 7.2.5.3 const_iterator begin() const Returns const_iterator pointing to beginning of the set of elements. 224 315415-014US 7.2.5.4 const_iterator end() const Returns const_iterator pointing to the end of the set of elements. 7.3 flattened2d Template Class Summary Adaptor that provides a flattened view of a container of containers. Syntax template class flattened2; template flattened2d flatten2d(const Container &c); template flattened2d flatten2d( const Container &c, const typename Container::const_iterator b, const typename Container::const_iterator e); Header #include "tbb/enumerable_thread_specific.h" Description A flattened2d provides a flattened view of a container of containers. Iterating from begin() to end()visits all of the elements in the inner containers. This can be useful when traversing a enumerable_thread_specific whose elements are containers. The utility function flatten2d creates a flattened2d object from a container. Example The following code shows a simple example usage of flatten2d and flattened2d. Each thread collects the values of i that are evenly divisible by K in a thread-local vector. In main, the results are printed by using a flattened2d to simplify the traversal of all of the elements in all of the local vectors. #include Thread Local Storage Reference Manual 225 #include #include #include "tbb/task_scheduler_init.h" #include "tbb/enumerable_thread_specific.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; // A VecType has a separate std::vector per thread typedef enumerable_thread_specific< std::vector > VecType; VecType MyVectors; int K = 1000000; struct Func { void operator()(const blocked_range& r) const { VecType::reference v = MyVectors.local(); for (int i=r.begin(); i!=r.end(); ++i) if( i%k==0 ) v.push_back(i); } }; int main() { parallel_for(blocked_range(0, 100000000), Func()); flattened2d flat_view = flatten2d( MyVectors ); for( flattened2d::const_iterator i = flat_view.begin(); i != flat_view.end(); ++i) cout << *i << endl; return 0; } Members namespace tbb { template class flattened2d { public: // Basic types 226 315415-014US typedef implementation-dependent size_type; typedef implementation-dependent difference_type; typedef implementation-dependent allocator_type; typedef implementation-dependent value_type; typedef implementation-dependent reference; typedef implementation-dependent const_reference; typedef implementation-dependent pointer; typedef implementation-dependent const_pointer; typedef implementation-dependent iterator; typedef implementation-dependent const_iterator; flattened2d( const Container& c ); flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last ); iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; size_type size() const; }; template flattened2d flatten2d(const Container &c); template flattened2d flatten2d( const Container &c, const typename Container::const_iterator first, const typename Container::const_iterator last); } 7.3.1 Whole Container Operations Safety These operations must not be invoked concurrently on the same flattend2d. Thread Local Storage Reference Manual 227 7.3.1.1 flattened2d( const Container& c ) Effects Constructs a flattened2d representing the sequence of elements in the inner containers contained by outer container c. 7.3.1.2 flattened2d( const Container& c, typename Container::const_iterator first, typename Container::const_iterator last ) Effects Constructs a flattened2d representing the sequence of elements in the inner containers in the half-open intervale [first, last) of Container c. 7.3.2 Concurrent Operations Safety These operations may be invoked concurrently on the same flattened2d. 7.3.2.1 size_type size() const Returns The sum of the sizes of the inner containers that are viewable in the flattened2d. 7.3.3 Iterators Template class flattened2d supports foward iterators only. 7.3.3.1 iterator begin() Returns iterator pointing to beginning of the set of local copies. 7.3.3.2 iterator end() Returns iterator pointing to end of the set of local copies. 228 315415-014US 7.3.3.3 const_iterator begin() const Returns const_iterator pointing to beginning of the set of local copies. 7.3.3.4 const_iterator end() const Returns const_iterator pointing to the end of the set of local copies. 7.3.4 Utility Functions template flattened2d flatten2d(const Container &c, const typename Container::const_iterator b, const typename Container::const_iterator e) Returns Constructs and returns a flattened2d that provides iterators that traverse the elements in the containers within the half-open range [b, e) of Container c. template flattened2d( const Container &c ) Returns Constructs and returns a flattened2d that provides iterators that traverse the elements in all of the containers within Container c. Memory Allocation Reference Manual 229 8 Memory Allocation This section describes classes related to memory allocation. 8.1 Allocator Concept The allocator concept for allocators in Intel® Threading Building Blocks is similar to the "Allocator requirements" in Table 32 of the ISO C++ Standard, but with further guarantees required by the ISO C++ Standard (Section 20.1.5 paragraph 4) for use with ISO C++ containers. Table 29 summarizes the allocator concept. Here, A and B 500H1017H represent instances of the allocator class. Table 29: Allocator Concept Pseudo-Signature Semantics typedef T* A::pointer Pointer to T. typedef const T* A::const_pointer Pointer to const T. typedef T& A::reference Reference to T. typedef const T& A::const_reference Reference to const T. typedef T A::value_type Type of value to be allocated. typedef size_t A::size_type Type for representing number of values. typedef ptrdiff_t A::difference_type Type for representing pointer difference. template struct rebind { typedef A A::other; }; Rebind to a different type U A() throw() Default constructor. A( const A& ) throw() Copy constructor. template A( const A& ) Rebinding constructor. ~A() throw() Destructor. T* A::address( T& x ) const Take address. const T* A::const_address( const T& x ) const Take const address. T* A::allocate( size_type n, const void* hint=0 ) Allocate space for n values. void A::deallocate( T* p, size_t n ) Deallocate n values. size_type A::max_size() const throw() Maximum plausible 230 315415-014US Pseudo-Signature Semantics argument to method allocate. void A::construct( T* p, const T& value ) new(p) T(value) void A::destroy( T* p ) p->T::~T() bool operator==( const A&, const B& ) Return true. bool operator!=( const A&, const B& ) Return false. Model Types Template classes tbb_allocactor (8.2), 1018H scalable_allocator (8.3), and 1019H cached_aligned_allocator (8.4), and 1020H zero_allocator (8.5) model the Allocator 1021H concept. 8.2 tbb_allocator Template Class Summary Template class for scalable memory allocation if available; possibly non-scalable otherwise. Syntax template class tbb_allocator Header #include "tbb/tbb_allocator.h" Description A tbb_allocator allocates and frees memory via the Intel® TBB malloc library if it is available, otherwise it reverts to using malloc and free. TIP: Set the environment variable TBB_VERSION to 1 to find out if the Intel® TBB malloc library is being used. Details are in Section 3.1.2. 1022H 8.3 scalable_allocator Template Class Summary Template class for scalable memory allocation. Memory Allocation Reference Manual 231 Syntax template class scalable_allocator; Header #include "tbb/scalable_allocator.h" Description A scalable_allocator allocates and frees memory in a way that scales with the number of processors. A scalable_allocator models the allocator requirements described in Table 29. Using a 501H1023H scalable_allocator in place of std::allocator may improve program performance. Memory allocated by a scalable_allocator should be freed by a scalable_allocator, not by a std::allocator. CAUTION: The scalable_allocator requires that the tbb malloc library be available. If the library is missing, calls to the scalable allocator fail. In contrast, tbb_allocator falls back on malloc and free if the tbbmalloc library is missing. Members See Allocator concept (8.1). 1024H Acknowledgement The scalable memory allocator incorporates McRT technology developed by Intel’s PSL CTG team. 8.3.1 C Interface to Scalable Allocator Summary Low level interface for scalable memory allocation. Syntax extern "C" { // Scalable analogs of C memory allocator void* scalable_malloc( size_t size ); void scalable_free( void* ptr ); void* scalable_calloc( size_t nobj, size_t size ); void* scalable_realloc( void* ptr, size_t size ); // Analog of _msize/malloc_size/malloc_usable_size. size_t scalable_msize( void* ptr ); // Scalable analog of posix_memalign 232 315415-014US int scalable_posix_memalign( void** memptr, size_t alignment, size_t size ); // Aligned allocation void* scalable_aligned_malloc( size_t size, size_t alignment); void scalable_aligned_free( void* ptr ); void* scalable_aligned_realloc( void* ptr, size_t size, size_t alignment ); } Header #include "tbb/scalable_allocator.h" Description These functions provide a C level interface to the scalable allocator. Each routine scalable_x behaves analogously to library function x. The routines form the two families shown in Table 30. Storage allocated by a scalable_ 1025H x function in one family must be freed or resized by a scalable_x function in the same family, not by a C standard library function. Likewise storage allocated by a C standard library function should not be freed or resized by a scalable_x function. Table 30: C Interface to Scalable Allocator Family Allocation Routine Deallocation Routine Analogous Library scalable_malloc scalable_calloc scalable_realloc C standard library 1 scalable_posix_memalign scalable_free POSIX*22F 23 2 scalable_aligned_malloc scalable_aligned_free scalable_aligned_free Microsoft* C run-time library 23 See "The Open Group* Base Specifications Issue 6", IEEE* Std 1003.1, 2004 Edition for the definition of posix_memalign. Memory Allocation Reference Manual 233 scalable_aligned_realloc 8.3.1.1 size_t scalable_msize( void* ptr ) Returns The usable size of the memory block pointed to by ptr if it was allocated by the scalable allocator. Returns zero if ptr does not point to such a block. 8.4 cache_aligned_allocator Template Class Summary Template class for allocating memory in way that avoids false sharing. Syntax template class cache_aligned_allocator; Header #include "tbb/cache_aligned_allocator.h" Description A cache_aligned_allocator allocates memory on cache line boundaries, in order to avoid false sharing. False sharing is when logically distinct items occupy the same cache line, which can hurt performance if multiple threads attempt to access the different items simultaneously. Even though the items are logically separate, the processor hardware may have to transfer the cache line between the processors as if they were sharing a location. The net result can be much more memory traffic than if the logically distinct items were on different cache lines. A cache_aligned_allocator models the allocator requirements described in Table 29. 501H1026H It can be used to replace a std::allocator. Used judiciously, cache_aligned_allocator can improve performance by reducing false sharing. However, it is sometimes an inappropriate replacement, because the benefit of allocating on a cache line comes at the price that cache_aligned_allocator implicitly adds pad memory. The padding is typically 128 bytes. Hence allocating many small objects with cache_aligned_allocator may increase memory usage. Members namespace tbb { 234 315415-014US template class cache_aligned_allocator { public: typedef T* pointer; typedef const T* const_pointer; typedef T& reference; typedef const T& const_reference; typedef T value_type; typedef size_t size_type; typedef ptrdiff_t difference_type; template struct rebind { typedef cache_aligned_allocator other; }; #if _WIN64 char* _Charalloc( size_type size ); #endif /* _WIN64 */ cache_aligned_allocator() throw(); cache_aligned_allocator( const cache_aligned_allocator& ) throw(); template cache_aligned_allocator( const cache_aligned_allocator& ) throw(); ~cache_aligned_allocator(); pointer address(reference x) const; const_pointer address(const_reference x) const; pointer allocate( size_type n, const void* hint=0 ); void deallocate( pointer p, size_type ); size_type max_size() const throw(); void construct( pointer p, const T& value ); void destroy( pointer p ); }; template<> class cache_aligned_allocator { public: typedef void* pointer; typedef const void* const_pointer; typedef void value_type; template struct rebind { Memory Allocation Reference Manual 235 typedef cache_aligned_allocator other; }; }; template bool operator==( const cache_aligned_allocator&, const cache_aligned_allocator& ); template bool operator!=( const cache_aligned_allocator&, const cache_aligned_allocator& ); } For sake of brevity, the following subsections describe only those methods that differ significantly from the corresponding methods of std::allocator. 8.4.1 pointer allocate( size_type n, const void* hint=0 ) Effects Allocates size bytes of memory on a cache-line boundary. The allocation may include extra hidden padding. Returns Pointer to the allocated memory. 8.4.2 void deallocate( pointer p, size_type n ) Requirements Pointer p must be result of method allocate(n). The memory must not have been already deallocated. Effects Deallocates memory pointed to by p. The deallocation also deallocates any extra hidden padding. 236 315415-014US 8.4.3 char* _Charalloc( size_type size ) NOTE: This method is provided only on 64-bit Windows* OS platforms. It is a non-ISO method that exists for backwards compatibility with versions of Window's containers that seem to require it. Please do not use it directly. 8.5 zero_allocator Summary Template class for allocator that returns zeroed memory. Syntax template class Alloc = tbb_allocator> class zero_allocator: public Alloc; Header #include "tbb/tbb_allocator.h" Description A zero_allocator allocates zeroed memory. A zero_allocator can be instantiated for any class A that models the Allocator concept. The default for A is tbb_allocator. A zero_allocator forwards allocation requests to A and zeros the allocation before returning it. Members namespace tbb { template class Alloc = tbb_allocator> class zero_allocator : public Alloc { public: typedef Alloc base_allocator_type; typedef typename base_allocator_type::value_type value_type; typedef typename base_allocator_type::pointer pointer; typedef typename base_allocator_type::const_pointer const_pointer; typedef typename base_allocator_type::reference reference; typedef typename base_allocator_type::const_reference const_reference; typedef typename base_allocator_type::size_type Memory Allocation Reference Manual 237 size_type; typedef typename base_allocator_type::difference_type difference_type; template struct rebind { typedef zero_allocator other; }; zero_allocator() throw() { } zero_allocator(const zero_allocator &a) throw(); template zero_allocator(const zero_allocator &a) throw(); pointer allocate(const size_type n, const void* hint=0); }; } 8.6 aligned_space Template Class Summary Uninitialized memory space for an array of a given type. Syntax template class aligned_space; Header #include "tbb/aligned_space.h" Description An aligned_space occupies enough memory and is sufficiently aligned to hold an array T[N]. The client is responsible for initializing or destroying the objects. An aligned_space is typically used as a local variable or field in scenarios where a block of fixed-length uninitialized memory is needed. Members namespace tbb { template class aligned_space { public: aligned_space(); ~aligned_space(); T* begin(); T* end(); 238 315415-014US }; } 8.6.1 aligned_space() Effects None. Does not invoke constructors. 8.6.2 ~aligned_space() Effects None. Does not invoke destructors. 8.6.3 T* begin() Returns Pointer to beginning of storage. 8.6.4 T* end() Returns begin()+N Synchronization Reference Manual 239 9 Synchronization The library supports mutual exclusion and atomic operations. 9.1 Mutexes Mutexes provide MUTual EXclusion of threads from sections of code. In general, strive for designs that minimize the use of explicit locking, because it can lead to serial bottlenecks. If explicitly locking is necessary, try to spread it out so that multiple threads usually do not contend to lock the same mutex. 9.1.1 Mutex Concept The mutexes and locks here have relatively spartan interfaces that are designed for high performance. The interfaces enforce the scoped locking pattern, which is widely used in C++ libraries because: 1. Does not require the programmer to remember to release the lock 2. Releases the lock if an exception is thrown out of the mutual exclusion region protected by the lock There are two parts to the pattern: a mutex object, for which construction of a lock object acquires a lock on the mutex and destruction of the lock object releases the lock. Here’s an example: { // Construction of myLock acquires lock on myMutex M::scoped_lock myLock( myMutex ); ... actions to be performed while holding the lock ... // Destruction of myLock releases lock on myMutex } If the actions throw an exception, the lock is automatically released as the block is exited. Table 31 shows the requirements for the Mutex concept for a mutex type M 502H1027H240 315415-014US Table 31: Mutex Concept Pseudo-Signature Semantics M() Construct unlocked mutex. ~M() Destroy unlocked mutex. typename M::scoped_lock Corresponding scoped-lock type. M::scoped_lock() Construct lock without acquiring mutex. M::scoped_lock(M&) Construct lock and acquire lock on mutex. M::~scoped_lock() Release lock (if acquired). M::scoped_lock::acquire(M&) Acquire lock on mutex. bool M::scoped_lock::try_acquire(M&) Try to acquire lock on mutex. Return true if lock acquired, false otherwise. M::scoped_lock::release() Release lock. static const bool M::is_rw_mutex True if mutex is reader-writer mutex; false otherwise. static const bool M::is_recursive_mutex True if mutex is recursive mutex; false otherwise. static const bool M::is_fair_mutex True if mutex is fair; false otherwise. Table 32 summarizes the classes that model the Mutex concept. 1028H Table 32: Mutexes that Model the Mutex Concept Scalable Fair Reentrant Long Wait Size mutex OS dependent OS dependent No Blocks = 3 words recursive_mutex OS dependent OS dependent Yes Blocks = 3 words spin_mutex No No No Yields 1 byte queuing_mutex 9 9 No Yields 1 word spin_rw_mutex No No No Yields 1 word queuing_rw_mutex 9 9 No Yields 1 word null_mutex - Yes Yes - empty null_rw_mutex - Yes Yes - empty See the Tutorial, Section 6.1.1, for a discussion of the mutex properties and the rationale for null mutexes. 9.1.1.1 C++ 200x Compatibility Classes mutex, recursive_mutex, spin_mutex, and spin_rw_mutex support the C++ 200x interfaces described in Table 33. 1029HSynchronization Reference Manual 241 Table 33: C++ 200x Methods Available for Some Mutexes. Pseudo-Signature Semantics void M::lock() Acquire lock. bool M::try_lock() Try to acquire lock on mutex. Return true if lock acquired, false otherwise. void M::unlock() Release lock. class lock_guard class unique_lock See Section 22H 9.4 1030H Classes mutex and recursive mutex also provide the C++ 200x idiom for accessing their underlying OS handles, as described in Table 34. 1031H Table 34: Native handle interface (M is mutex or recursive_mutex). Pseudo-Signature Semantics M::native_handle_type Native handle type. Operating system Native handle type Windows* operating system LPCRITICAL_SECTION Other operationing systems (pthread_mutex*) native_handle_type M::native_handle() Get underlying native handle of mutex M. As an extension to C++ 200x, class spin_rw_mutex also has methods read_lock() and try_read_lock() for corresponding operations that acquire reader locks. 9.1.2 mutex Class Summary Class that models Mutex Concept using underlying OS locks. Syntax class mutex; Header #include "tbb/mutex.h" 242 315415-014US Description A mutex models the Mutex Concept (9.1.1). It is a wrapper around OS calls that 504H1032H provide mutual exclusion. The advantages of using mutex instead of the OS calls are: • Portable across all operating systems supported by Intel® Threading Building Blocks. • Releases the lock if an exception is thrown from the protected region of code. Members See Mutex Concept (9.1.1). 505H1033H 9.1.3 recursive_mutex Class Summary Class that models Mutex Concept using underlying OS locks and permits recursive acquisition. Syntax class recursive_mutex; Header #include "tbb/recursive_mutex.h" Description A recursive_mutex is similar to a mutex (9.1.2), except that a thread may acquire 1034H multiple locks on it. The thread must release all locks on a recursive_mutex before any other thread can acquire a lock on it. Members See Mutex Concept (9.1.1). 505H1035H 9.1.4 spin_mutex Class Summary Class that models Mutex Concept using a spin lock. Syntax class spin_mutex; Synchronization Reference Manual 243 Header #include "tbb/spin_mutex.h" Description A spin_mutex models the Mutex Concept (9.1.1). A 506H1036H spin_mutex is not scalable, fair, or recursive. It is ideal when the lock is lightly contended and is held for only a few machine instructions. If a thread has to wait to acquire a spin_mutex, it busy waits, which can degrade system performance if the wait is long. However, if the wait is typically short, a spin_mutex significantly improve performance compared to other mutexes. Members See Mutex Concept (9.1.1). 507H1037H 9.1.5 queuing_mutex Class Summary Class that models Mutex Concept that is fair and scalable. Syntax class queuing_mutex; Header #include "tbb/queuing_mutex.h" Description A queuing_mutex models the Mutex Concept (9.1.1). A 508H1038H queuing_mutex is scalable, in the sense that if a thread has to wait to acquire the mutex, it spins on its own local cache line. A queuing_mutex is fair. Threads acquire a lock on a mutex in the order that they request it. A queuing_mutex is not recursive. The current implementation does busy-waiting, so using a queuing_mutex may degrade system performance if the wait is long. Members See Mutex Concept (9.1.1). 509H1039H 9.1.6 ReaderWriterMutex Concept The ReaderWriterMutex concept extends the Mutex Concept to include the notion of reader-writer locks. It introduces a boolean parameter write that specifies whether a 244 315415-014US writer lock (write =true) or reader lock (write =false) is being requested. Multiple reader locks can be held simultaneously on a ReaderWriterMutex if it does not have a writer lock on it. A writer lock on a ReaderWriterMutex excludes all other threads from holding a lock on the mutex at the same time. Table 35 shows the requirements for a ReaderWriterMutex 1040H RW. They form a superset of the Mutex Concept (9.1.1). 1041H Table 35: ReaderWriterMutex Concept Pseudo-Signature Semantics RW() Construct unlocked mutex. ~RW() Destroy unlocked mutex. typename RW::scoped_lock Corresponding scoped-lock type. RW::scoped_lock() Construct lock without acquiring mutex. RW::scoped_lock(RW&, bool write=true) Construct lock and acquire lock on mutex. RW::~scoped_lock() Release lock (if acquired). RW::scoped_lock::acquire(RW&, bool write=true) Acquire lock on mutex. bool RW::scoped_lock::try_acquire(RW&, bool write=true) Try to acquire lock on mutex. Return true if lock acquired, false otherwise. RW::scoped_lock::release() Release lock. bool RW::scoped_lock::upgrade_to_writer() Change reader lock to writer lock. bool RW::scoped_lock::downgrade_to_reader() Change writer lock to reader lock. static const bool RW::is_rw_mutex = true True. static const bool RW::is_recursive_mutex True if mutex is reader-writer mutex; false otherwise. For all current reader-writer mutexes, false. static const bool RW::is_fair_mutex True if mutex is fair; false otherwise. The following subsections explain the semantics of the ReaderWriterMutex concept in detail. Model Types Classes spin_rw_mutex (9.1.7) and 1042H queuing_rw_mutex (9.1.8) model the 1043H ReaderWriterMutex concept. Synchronization Reference Manual 245 9.1.6.1 ReaderWriterMutex() Effects Constructs unlocked ReaderWriterMutex. 9.1.6.2 ~ReaderWriterMutex() Effects Destroys unlocked ReaderWriterMutex. The effect of destroying a locked ReaderWriterMutex is undefined. 9.1.6.3 ReaderWriterMutex::scoped_lock() Effects Constructs a scoped_lock object that does not hold a lock on any mutex. 9.1.6.4 ReaderWriterMutex::scoped_lock( ReaderWriterMutex& rw, bool write =true) Effects Constructs a scoped_lock object that acquires a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. 9.1.6.5 ReaderWriterMutex::~scoped_lock() Effects If the object holds a lock on a ReaderWriterMutex, releases the lock. 9.1.6.6 void ReaderWriterMutex:: scoped_lock:: acquire( ReaderWriterMutex& rw, bool write=true ) Effects Acquires a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. 246 315415-014US 9.1.6.7 bool ReaderWriterMutex:: scoped_lock::try_acquire( ReaderWriterMutex& rw, bool write=true ) Effects Attempts to acquire a lock on mutex rw. The lock is a writer lock if write is true; a reader lock otherwise. Returns true if the lock is acquired, false otherwise. 9.1.6.8 void ReaderWriterMutex:: scoped_lock::release() Effects Releases lock. The effect is undefined if no lock is held. 9.1.6.9 bool ReaderWriterMutex:: scoped_lock::upgrade_to_writer() Effects Changes reader lock to a writer lock. The effect is undefined if the object does not already hold a reader lock. Returns false if lock was released in favor of another upgrade request and then reacquired; true otherwise. 9.1.6.10 bool ReaderWriterMutex:: scoped_lock::downgrade_to_reader() Effects Changes writer lock to a reader lock. The effect is undefined if the object does not already hold a writer lock. Returns false if lock was released and reacquired; true otherwise. Intel's current implementations for spin_rw_mutex and queuing_rw_mutex always return true. Different implementations might sometimes return false. Synchronization Reference Manual 247 9.1.7 spin_rw_mutex Class Summary Class that models ReaderWriterMutex Concept that is unfair and not scalable. Syntax class spin_rw_mutex; Header #include "tbb/spin_rw_mutex.h" Description A spin_rw_mutex models the ReaderWriterMutex Concept (9.1.6). A 1044H spin_rw_mutex is not scalable, fair, or recursive. It is ideal when the lock is lightly contended and is held for only a few machine instructions. If a thread has to wait to acquire a spin_rw_mutex, it busy waits, which can degrade system performance if the wait is long. However, if the wait is typically short, a spin_rw_mutex significantly improve performance compared to other mutexes.. Members See ReaderWriterMutex concept (9.1.6). 1045H 9.1.8 queuing_rw_mutex Class Summary Class that models ReaderWriterMutex Concept that is fair and scalable. Syntax class queuing_rw_mutex; Header #include "tbb/queuing_rw_mutex.h" Description A queuing_rw_mutex models the ReaderWriterMutex Concept (9.1.6). A 1046H queuing_rw_mutex is scalable, in the sense that if a thread has to wait to acquire the mutex, it spins on its own local cache line. A queuing_rw_mutex is fair. Threads acquire a lock on a queuing_rw_mutex in the order that they request it. A queuing_rw_mutex is not recursive. 248 315415-014US Members See ReaderWriterMutex concept (9.1.6). 1047H 9.1.9 null_mutex Class Summary Class that models Mutex Concept buts does nothing. Syntax class null_mutex; Header #include "tbb/null_mutex.h" Description A null_mutex models the Mutex Concept (9.1.1) syntactically, but does nothing. It is 504H1048H useful for instantiating a template that expects a Mutex, but no mutual exclusion is actually needed for that instance. Members See Mutex Concept (9.1.1). 505H1049H 9.1.10 null_rw_mutex Class Summary Class that models ReaderWriterMutex Concept but does nothing. Syntax class null_rw_mutex; Header #include "tbb/null_rw_mutex.h" Description A null_rw_mutex models the ReaderWriterMutex Concept (9.1.6) syntactically, but 1050H does nothing. It is useful for instantiating a template that expects a ReaderWriterMutex, but no mutual exclusion is actually needed for that instance.. Synchronization Reference Manual 249 Members See ReaderWriterMutex concept (9.1.6). 1051H 9.2 atomic Template Class Summary Template class for atomic operations. Syntax template atomic; Header #include "tbb/atomic.h" Description An atomic supports atomic read, write, fetch-and-add, fetch-and-store, and compare-and-swap. Type T may be an integral type, enumeration type, or a pointer type. When T is a pointer type, arithmetic operations are interpreted as pointer arithmetic. For example, if x has type atomic and a float occupies four bytes, then ++x advances x by four bytes. Arithmetic on atomic is not allowed if T is an enumeration type, void*, or bool. Some of the methods have template method variants that permit more selective memory fencing. On IA-32 and Intel® 64 architecture processors, they have the same effect as the non-templated variants. On IA-64 architecture (Itanium®) processors, they may improve performance by allowing the memory subsystem more latitude on the orders of reads and write. Using them may improve performance. Table 36 shows 511H1052H the fencing for the non-template form. Table 36: Operation Order Implied by Non-Template Methods Kind Description Default For acquire Operations after the atomic operation never move over it. read release Operations before the atomic operation never move over it. write sequentially consistent Operations on either side never move over it and furthermore, the sequentially consistent atomic operations have a global order. fetch_and_store, fetch_and_add, compare_and_swap250 315415-014US CAUTION: The copy constructor for class atomic is not atomic. To atomically copy an atomic, default-construct the copy first and assign to it. Below is an example that shows the difference. atomic y(x); // Not atomic atomic z; z=x; // Atomic assignment The copy constructor is not atomic because it is compiler generated. Introducing any non-trivial constructors might remove an important property of atomic: namespace scope instances are zero-initialized before namespace scope dynamic initializers run. This property can be essential for code executing early during program startup. To create an atomic with a specific value, default-construct it first, and afterwards assign a value to it. Members namespace tbb { enum memory_semantics { acquire, release }; struct atomic { typedef T value_type; template value_type compare_and_swap( value_type new_value, value_type comparand ); value_type compare_and_swap( value_type new_value, value_type comparand ); template value_type fetch_and_store( value_type new_value ); value_type fetch_and_store( value_type new_value ); operator value_type() const; value_type operator=( value_type new_value ); atomic& operator=( const atomic& value ); // The following members exist only if T is an integral // or pointer type. Synchronization Reference Manual 251 template value_type fetch_and_add( value_type addend ); value_type fetch_and_add( value_type addend ); template value_type fetch_and_increment(); value_type fetch_and_increment(); template value_type fetch_and_decrement(); value_type fetch_and_decrement(); value_type operator+=(value_type); value_type operator-=(value_type); value_type operator++(); value_type operator++(int); value_type operator--(); value_type operator--(int); }; } So that an atomic can be used like a pointer to T, the specialization atomic also defines: T* operator->() const; 9.2.1 memory_semantics Enum Description Defines values used to select the template variants that permit more selective control over visibility of operations (see Table 36). 1053H 9.2.2 value_type fetch_and_add( value_type addend ) Effects Let x be the value of *this. Atomically updates x = x + addend. 252 315415-014US Returns Original value of x. 9.2.3 value_type fetch_and_increment() Effects Let x be the value of *this. Atomically updates x = x + 1. Returns Original value of x. 9.2.4 value_type fetch_and_decrement() Effects Let x be the value of *this. Atomically updates x = x - 1. Returns Original value of x. 9.2.5 value_type compare_and_swap value_type compare_and_swap( value_type new_value, value_type comparand ) Effects Let x be the value of *this. Atomically compares x with comparand, and if they are equal, sets x=new_value. Returns Original value of x. 9.2.6 value_type fetch_and_store( value_type new_value ) Effects Let x be the value of *this. Atomically exchanges old value of x with new_value. Synchronization Reference Manual 253 Returns Original value of x. 9.3 PPL Compatibility Classes critical_section and reader_writer_lock exist for compatibility with the Microsoft Parallel Patterns Library (PPL). They do not follow all of the conventions of other mutexes in Intel® Threading Building Blocks. 9.3.1 critical_section Summary A PPL-compatible mutex. Syntax class critical_section; Header #include "tbb/critical_section.h" Description A critical_section implements a PPL critical_section. Its functionality is a subset of the functionality of a tbb::mutex. Members namespace tbb { class critical_section { public: critical_section(); ~critical_section(); void lock(); bool try_lock(); void unlock(); class scoped_lock { public: scoped_lock( critical_section& mutex ); ~scoped_lock(); }; }; } 254 315415-014US 9.3.2 reader_writer_lock Class Summary A PPL-compatible reader-writer mutex that is scalable and gives preference to writers. Syntax class reader_writer_lock; Header #include "tbb/reader_writer_lock.h" Description A reader_writer_lock implements a PPL-compatible reader-writer mutex. A reader_writer_lock is scalable and nonrecursive. The implementation handles lock requests on a first-come first-serve basis except that writers have preference over readers. Waiting threads busy wait, which can degrade system performance if the wait is long. However, if the wait is typically short, a reader_writer_lock can provide performance competitive with other mutexes. A reader_writer_lock models part of the ReaderWriterMutex Concept (9.1.6) and 1054H part of the C++ 200x compatibility interface (9.1.1.1). The major differences are: 1055H • The scoped interfaces support only strictly scoped locks. For example, the method scoped_lock::release() is not supported. • Reader locking has a separate interface. For example, there is separate scoped interface scoped_lock_read for reader locking, instead of a flag to distinguish the reader cases as in the ReaderWriterMutex Concept. Members namespace tbb { class reader_writer_lock { public: reader_writer_lock(); ~reader_writer_lock(); void lock(); void lock_read(); bool try_lock(); bool try_lock_read(); void unlock(); class scoped_lock { public: scoped_lock( reader_writer_lock& mutex ); ~scoped_lock(); }; Synchronization Reference Manual 255 class scoped_lock_read { public: scoped_lock_read( reader_writer_lock& mutex ); ~scoped_lock_read(); }; }; } Table 37 summarizes the semantics. 1056H Table 37: reader_writer_lock Members Summary Member Semantics reader_writer_lock() Construct unlocked mutex. ~reader_writer_lock() Destroy unlocked mutex. void reader_writer_lock::lock() Acquire write lock on mutex. void reader_writer_lock::lock_read() Acquire read lock on mutex. bool reader_writer_lock::try_lock() Try to acquire write lock on mutex. Returns true if lock acquired, false otherwise. bool reader_writer_lock::try_lock_read() Try to acquire read lock on mutex. Returns true if lock acquired, false otherwise. reader_writer_lock::unlock() Release lock. reader_writer_lock::scoped_lock (reader_writer_lock& m) Acquire write lock on mutex m. reader_writer_lock::~scoped_lock() Release write lock (if acquired). reader_writer_lock::scoped_lock_read (reader_writer_lock& m) Acquire read lock on mutex m. reader_writer_lock::~scoped_lock_read() Release read lock (if acquired). 9.4 C++ 200x Synchronization Intel® TBB approximates a portion of C++ 200x interfaces for condition variables and scoped locking. The approximation is based on the C++0x working draft N3000 23H . The major differences are: • The implementation uses the tbb::tick_count 24H interface instead of the C++ 200x interface. • The implementation throws std::runtime_error instead of a C++ 200x std::system_error. 256 315415-014US • The implementation omits or approximates features requiring C++ 200x language support such as constexpr or explicit operators. • The implementation works in conjunction with tbb::mutex wherever the C++ 200x specification calls for a std::mutex. See 9.1.1.1 for more about C++ 200x mutex 25H1057H support in Intel® TBB. See the working draft N3000 26H for a detailed descriptions of the members. CAUTION: Implementations may change if the C++ 200x specification changes. CAUTION: When support for std::system_error becomes available, implementations may throw std::system_error instead of std::runtime_error. The library defines the C++ 200x interfaces in namespace std, not namespace tbb, as explained in Section 2.4.7. 27H1058H Header #include “tbb/compat/condition_variable” Members namespace std { struct defer_lock_t { }; struct try_to_lock_t { }; struct adopt_lock_t { }; const defer_lock_t defer_lock = {}; const try_to_lock_t try_to_lock = {}; const adopt_lock_t adopt_lock = {}; template class lock_guard { public: typedef M mutex_type; explicit lock_guard(mutex_type& m); lock_guard(mutex_type& m, adopt_lock_t); ~lock_guard(); }; template class unique_lock: no_copy { public: typedef M mutex_type; unique_lock(); explicit unique_lock(mutex_type& m); unique_lock(mutex_type& m, defer_lock_t); unique_lock(mutex_type& m, try_to_lock_t)); Synchronization Reference Manual 257 unique_lock(mutex_type& m, adopt_lock_t); unique_lock(mutex_type& m, const tick_count::interval_t &i); ~unique_lock(); void lock(); bool try_lock(); bool try_lock_for( const tick_count::interval_t &i ); void unlock(); void swap(unique_lock& u); mutex_type* release(); bool owns_lock() const; operator bool() const; mutex_type* mutex() const; }; template void swap(unique_lock& x, unique_lock& y); enum cv_status {no_timeout, timeout}; class condition_variable : no_copy { public: condition_variable(); ~condition_variable(); void notify_one(); void notify_all(); void wait(unique_lock& lock); template void wait(unique_lock& lock, Predicate pred); cv_status wait_for(unique_lock& lock, const tick_count::interval_t& i); template bool wait_for(unique_lock& lock, const tick_count::interval_t &i, 258 315415-014US Predicate pred); typedef implementation-defined native_handle_type; native_handle_type native_handle(); }; } // namespace std Timing Reference Manual 259 10 Timing Parallel programming is about speeding up wall clock time, which is the real time that it takes a program to run. Unfortunately, some of the obvious wall clock timing routines provided by operating systems do not always work reliably across threads, because the hardware thread clocks are not synchronized. The library provides support for timing across threads. The routines are wrappers around operating services that we have verified as safe to use across threads. 10.1 tick_count Class Summary Class for computing wall-clock times. Syntax class tick_count; Header #include "tbb/tick_count.h" Description A tick_count is an absolute timestamp. Two tick_count objects may be subtracted to compute a relative time tick_count::interval_t, which can be converted to seconds. Example using namespace tbb; void Foo() { tick_count t0 = tick_count::now(); ...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() ); } Members namespace tbb { class tick_count { 260 315415-014US public: class interval_t; static tick_count now(); }; tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ); } // tbb 10.1.1 static tick_count tick_count::now() Returns Current wall clock timestamp. CAUTION: On Microsoft Windows* operating systems, the current implementation uses the function QueryPerformanceCounter. Some systems may have bugs in their basic input/output system (BIOS) or hardware abstraction layer (HAL) that cause different processors to return different results. 10.1.2 tick_count::interval_t operator-( const tick_count& t1, const tick_count& t0 ) Returns Relative time that t1 occurred after t0. 10.1.3 tick_count::interval_t Class Summary Class for relative wall-clock time. Syntax class tick_count::interval_t; Header #include "tbb/tick_count.h" Description A tick_count::interval_t represents relative wall clock duration. Timing Reference Manual 261 Members namespace tbb { class tick_count::interval_t { public: interval_t(); explicit interval_t( double sec ); double seconds() const; interval_t operator+=( const interval_t& i ); interval_t operator-=( const interval_t& i ); }; tick_count::interval_t operator+( const tick_count::interval_t& i, const tick_count::interval_t& j ); tick_count::interval_t operator-( const tick_count::interval_t& i, const tick_count::interval_t& j ); } // namespace tbb 10.1.3.1 interval_t() Effects Constructs interval_t representing zero time duration. 10.1.3.2 interval_t( double sec ) Effects Constructs interval_t representing specified number of seconds. 10.1.3.3 double seconds() const Returns Time interval measured in seconds. 10.1.3.4 interval_t operator+=( const interval_t& i ) Effects *this = *this + i 262 315415-014US Returns Reference to *this. 10.1.3.5 interval_t operator-=( const interval_t& i ) Effects *this = *this - i Returns Reference to *this. 10.1.3.6 interval_t operator+ ( const interval_t& i, const interval_t& j ) Returns Interval_t representing sum of intervals i and j. 10.1.3.7 interval_t operator- ( const interval_t& i, const interval_t& j ) Returns Interval_t representing difference of intervals i and j. Task Groups Reference Manual 263 11 Task Groups This chapter covers the high-level interface to the task scheduler. Chapter 12 covers 1059H the low-level interface. The high-level interface lets you easily create groups of potentially parallel tasks from functors or lambda expressions. The low-level interface permits more detailed control, such as control over exception propogation and affinity. Summary High-level interface for running functions in parallel. Syntax template task_handle; template task_handle make_task( const Func& f ); enum task_group_status; class task_group; class structured_task_group; bool is_current_task_group_canceling(); Header #include "tbb/task_group.h" Requirements Functor arguments for various methods in this chapter should meet the requirements in Table 38. 1060H Table 38: Requirements on functor arguments Pseudo-Signature Semantics Func::Func (const Func&) Copy constructor. Func::~Func () Destructor. void Func::operator()() const; Evaluate functor. 264 315415-014US 11.1 task_group Class Description A task_group represents concurrent execution of a group of tasks. Tasks may be dynamically added to the group as it is executing. Example with Lambda Expressions #include "tbb/task_group.h" using namespace tbb; int Fib(int n) { if( n<2 ) { return n; } else { int x, y; task_group g; g.run([&]{x=Fib(n-1);}); // spawn a task g.run([&]{y=Fib(n-2);}); // spawn another task g.wait(); // wait for both tasks to complete return x+y; } } CAUTION: Creating a large number of tasks for a single task_group is not scalable, because task creation becomes a serial bottleneck. If creating more than a small number of concurrent tasks, consider using parallel_for (4.4) or 1061H parallel_invoke (4.12) 1062H instead, or structure the spawning as a recursive tree. Members namespace tbb { class task_group { public: task_group(); ~task_group(); template void run( const Func& f ); template void run( task_handle& handle ); template void run_and_wait( const Func& f ); Task Groups Reference Manual 265 template void run_and_wait( task_handle& handle ); task_group_status wait(); bool is_canceling(); void cancel(); } } 11.1.1 task_group() Constructs an empty task group. 11.1.2 ~task_group() Requires Method wait must be called before destroying a task_group, otherwise the destructor throws an exception. 11.1.3 template void run( const Func& f ) Effects Spawn a task that computes f() and return immediately. 11.1.4 template void run ( task_handle& handle ); Effects Spawn a task that computes handle() and return immediately. 11.1.5 template void run_and_wait( const Func& f ) Effects Equivalent to {run(f); wait();}, but guarantees that f runs on the current thread. 266 315415-014US NOTE: Template method run_and_wait is intended to be more efficient than separate calls to run and wait. 11.1.6 template void run _and_wait( task_handle& handle ); Effects Equivalent to {run(handle); wait();}, but guarantees that handle() runs on the current thread. NOTE: Template method run_and_wait is intended to be more efficient than separate calls to run and wait. 11.1.7 task_group_status wait() Effects Wait for all tasks in the group to complete or be cancelled. 11.1.8 bool is_canceling() Returns True if this task group is cancelling its tasks. 11.1.9 void cancel() Effects Cancel all tasks in this task_group. 11.2 task_group_status Enum A task_group_status represents the status of a task_group. Members namespace tbb { enum task_group_status { not_complete, // Not cancelled and not all tasks in group have completed. Task Groups Reference Manual 267 complete, // Not cancelled and all tasks in group have completed canceled // Task group received cancellation request }; } 11.3 task_handle Template Class Summary Template class used to wrap a function object in conjunction with class structured_task_group. Description Class task_handle is used primarily in conjunction with class structured_task_group. For sake of uniformity, class task_group also accepts task_handle arguments. Members template class task_handle { public: task_handle( const Func& f ); void operator()() const; }; 11.4 make_task Template Function Summary Template function for creating a task_handle from a function or functor. Syntax template task_handle make_task( const Func& f ); Returns task_handle(f) 268 315415-014US 11.5 structured_task_group Class Description A structured_task_group is like a task_group, but has only a subset of the functionality. It may permit performance optimizations in the future. The restrictions are: o Methods run and run_and_wait take only task_handle arguments, not general functors. o Methods run and run_and_wait do not copy their task_handle arguments. The caller must not destroy those arguments until after wait or run_and_wait returns. o Methods run, run_and_wait, cancel, and wait should be called only by the thread that created the structured_task_group. o Method wait (or run_and_wait) should be called only once on a given instance of structured_task_group. Example The function fork_join below evaluates f1() and f2(), in parallel if resources permit. #include "tbb/task_group.h" using namespace tbb; template void fork_join( const Func1& f1, const Func2& f2 ) { structured_task_group group; task_handle h1(f1); group.run(h1); // spawn a task task_handle h2(f2); group.run(h2); // spawn another task group.wait(); // wait for both tasks to complete // now safe to destroy h1 and h2 } Members namespace tbb { class structured_task_group { public: structured_task_group(); Task Groups Reference Manual 269 ~structured_task_group(); template void run( task_handle& handle ); template void run_and_wait( task_handle& handle ); task_group_status wait(); bool is_canceling(); void cancel(); }; } 11.6 is_current_task_group_canceling Function Returns True if innermost task group executing on this thread is cancelling its tasks. 270 315415-014US 12 Task Scheduler Intel Threading Building Blocks (Intel® TBB) provides a task scheduler, which is the engine that drives the algorithm templates (Section 4) and task groups (Section 512H1063H 11). 1064H You may also call it directly. Using tasks is often simpler and more efficient than using threads, because the task scheduler takes care of a lot of details. The tasks are quanta of computation. The scheduler maps these onto physical threads. The mapping is non-preemptive. Each thread has a method execute(). Once a thread starts running execute(), the task is bound to that thread until execute() returns. During that time, the thread services other tasks only when it waits on its predecessor tasks, at which time it may run the predecessor tasks, or if there are no pending predecessor tasks, the thread may service tasks created by other threads. The task scheduler is intended for parallelizing computationally intensive work. Because task objects are not scheduled preemptively, they should generally avoid making calls that might block for long periods, because meanwhile that thread is precluded from servicing other tasks. CAUTION: There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running. Potential parallelism is typically generated by a split/join pattern. Two basic patterns of split/join are supported. The most efficient is continuation-passing form, in which the programmer constructs an explicit “continuation” task. The parent task creates child tasks and specifies a continuation task to be executed when the children complete. The continuation inherits the parent’s ancestor. The parent task then exits; it does not block on its children. The children subsequently run, and after they (or their continuations) finish, the continuation task starts running. Figure 7 shows the steps. 513H1065H The running tasks at each step are shaded. parent parent continuation continuation continuation child child child child Task Scheduler Reference Manual 271 Figure 7: Continuation-passing Style Explicit continuation passing is efficient, because it decouples the thread’s stack from the tasks. However, it is more difficult to program. A second pattern is "blocking style", which uses implicit continuations. It is sometimes less efficient in performance, but more convenient to program. In this pattern, the parent task blocks until its children complete, as shown in Figure 8. 514H1066H parent parent child child child child parent parent Figure 8: Blocking Style The convenience comes with a price. Because the parent blocks, its thread’s stack cannot be popped yet. The thread must be careful about what work it takes on, because continually stealing and blocking could cause the stack to grow without bound. To solve this problem, the scheduler constrains a blocked thread such that it never executes a task that is less deep than its deepest blocked task. This constraint may impact performance because it limits available parallelism, and tends to cause threads to select smaller (deeper) subtrees than they would otherwise choose. 12.1 Scheduling Algorithm The scheduler employs a technique known as work stealing. Each thread keeps a "ready pool" of tasks that are ready to run. The ready pool is structured as a deque (double-ended queue) of task objects that were spawned. Additionally, there is a shared queue of task objects that were enqueued. The distinction between spawning a task and enqueuing a task affects when the scheduler runs the task. After completing a task t, a thread chooses its next task according to the first applicable rule below: 1. The task returned by t.execute() 2. The successor of t if t was its last completed predecessor. 3. A task popped from the end of the thread’s own deque. 4. A task with affinity for the thread. 5. A task popped from approximately the beginning of the shared queue. 6. A task popped from the beginning of another randomly chosen thread’s deque. 272 315415-014US When a thread spawns a task, it pushes it onto the end of its own deque. Hence rule (3) above gets the task most recently spawned by the thread, whereas rule (6) gets the least recently spawned task of another thread. When a thread enqueues a task, it pushes it onto the end of the shared queue. Hence rule (5) gets one of the less recently enqueued tasks, and has no preference for tasks that are enqueued. This is in contrast to spawned tasks, where by rule (3) a thread prefers its own most recently spawned task. Note the “approximately” in rule (5). For scalability reasons, the shared queue does not guarantee precise first-in first-out behavior. If strict first-in first-out behavior is desired, put the real work in a separate queue, and create tasks that pull work from that queue. The chapter “Non-Preemptive Priorities” in the Intel® TBB Design Patterns manual explains the technique. It is important to understand the implications of spawning versus enqueuing for nested parallelism. • Spawned tasks emphasize locality. Enqueued tasks emphasize fairness. • For nested parallelism, spawned tasks tend towards depth-first execution, whereas enqueued tasks cause breadth-first execution. Because the space demands of breadth-first execution can be exponentially higher than depth-first execution, enqueued tasks should be used with care. • A spawned task might never be executed until a thread explicitly waits on the task to complete. An enqueued tasks will eventually run if all previously enqueued tasks complete. In the case where there would ordinarily be no other worker thread to execute an enqueued task, the scheduler creates an extra worker. In general, used spawned tasks unless there is a clear reason to use an enqueued task. Spawned tasks yield the best balance between locality of reference, space efficiency, and parallelism. The algorithm for spawned tasks is similar to the work-stealing algorithm used by Cilk (Blumofe 1995 224H28H ). The notion of work-stealing dates back to the 1980s (Burton 1981 29H ). The thread affinity support is more recent (Acar 2000 30H ). 12.2 task_scheduler_init Class Summary Class that explicity represents thread's interest in task scheduling services. Syntax class task_scheduler_init; Header #include "tbb/task_scheduler_init.h" Task Scheduler Reference Manual 273 Description Using task_scheduler_init is optional in Intel® TBB 2.2. By default, Intel® TBB 2.2 automatically creates a task scheduler the first time that a thread uses task scheduling services and destroys it when the last such thread exits. An instance of task_scheduler_init can be used to control the following aspects of the task scheduler: • When the task scheduler is constructed and destroyed. • The number of threads used by the task scheduler. • The stack size for worker threads. To override the automatic defaults for task scheduling, a task_scheduler_init must become active before the first use of task scheduling services. A task_scheduler_init is either "active" or "inactive". The default constructor for a task_scheduler_init activates it, and the destructor deactivates it. To defer activation, pass the value task_scheduler_init::deferred to the constructor. Such a task_scheduler_init may be activated later by calling method initialize. Destruction of an active task_scheduler_init implicitly deactivates it. To deactivate it earlier, call method terminate. An optional parameter to the constructor and method initialize allow you to specify the number of threads to be used for task execution. This parameter is useful for scaling studies during development, but should not be set for production use. TIP: The reason for not specifying the number of threads in production code is that in a large software project, there is no way for various components to know how many threads would be optimal for other threads. Hardware threads are a shared global resource. It is best to leave the decision of how many threads to use to the task scheduler. To minimize time overhead, it is best to rely upon automatic creation of the task scheduler, or create a single task_scheduler_init object whose activation spans all uses of the library's task scheduler. A task_scheduler_init is not assignable or copyconstructible. Example // Sketch of one way to do a scaling study #include #include "tbb/task_scheduler_init.h" int main() { int n = task_scheduler_init::default_num_threads(); 274 315415-014US for( int p=1; p<=n; ++p ) { // Construct task scheduler with p threads task_scheduler_init init(p); tick_count t0 = tick_count::now(); ... execute parallel algorithm using task or template algorithm here... tick_count t1 = tick_count::now(); double t = (t1-t0).seconds(); cout << "time = " << t << " with " << p << "threads\n"; // Implicitly destroy task scheduler. } return 0; } Members namespace tbb { typedef unsigned-integral-type stack_size_type; class task_scheduler_init { public: static const int automatic = implementation-defined; static const int deferred = implementation-defined; task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ); ~task_scheduler_init(); void initialize( int max_threads=automatic ); void terminate(); static int default_num_threads(); bool is_active() const; }; } // namespace tbb 12.2.1 task_scheduler_init( int max_threads=automatic, stack_size_type thread_stack_size=0 ) Requirements The value max_threads shall be one of the values in Table 39. 516H1067H Effects If max_threads==task_scheduler_init::deferred, nothing happens, and the task_scheduler_init remains inactive. Otherwise, the task_scheduler_init is Task Scheduler Reference Manual 275 activated as follows. If the thread has no other active task_scheduler_init objects, the thread allocates internal thread-specific resources required for scheduling task objects. If there were no threads with active task_scheduler_init objects yet, then internal worker threads are created as described in Table 39. These workers sleep until 517H1068H needed by the task scheduler. Each worker created by the scheduler has an implicit active task_scheduler_init object. NOTE: As of TBB 3.0, it is meaningful for the parameter max_threads to differ for different calling threads. For example, if thread A specifies max_threads=3 and thread B specifies max_threads=7, then A is limited to having 2 workers, but B can have up to 6 workers. Since workers may be shared between A and B, the total number of worker threads created by the scheduler could be 6. NOTE: Some implementations create more workers than necessary. However, the excess workers remain asleep unless needed. The optional parameter thread_stack_size specifies the stack size of each worker thread. A value of 0 specifies use of a default stack size. The first active task_scheduler_init establishes the stack size for all worker threads. Table 39: Values for max_threads max_threads Semantics task_scheduler_init::automatic Let library determine max_threads based on hardware configuration. task_scheduler_init::deferred Defer activation actions. positive integer Request that up to max_threads-1 worker threads work on behalf of the calling thread at any one time. 12.2.2 ~task_scheduler_init() Effects If the task_scheduler_init is inactive, nothing happens. Otherwise, the task_scheduler_init is deactivated as follows. If the thread has no other active task_scheduler_init objects, the thread deallocates internal thread-specific resources required for scheduling task objects. If no existing thread has any active task_scheduler_init objects, then the internal worker threads are terminated. 276 315415-014US 12.2.3 void initialize( int max_threads=automatic ) Requirements The task_scheduler_init shall be inactive. Effects Similar to constructor (12.2.1). 518H1069H 12.2.4 void terminate() Requirements The task_scheduler_init shall be active. Effects Deactivates the task_scheduler_init without destroying it. The description of the destructor (12.2.2) specifies what deactivation entails. 519H1070H 12.2.5 int default_num_threads() Returns One more than the number of worker threads that task_scheduler_init creates by default. 12.2.6 bool is_active() const Returns True if *this is active as described in Section 12.2; false otherwise. 1071H 12.2.7 Mixing with OpenMP Mixing OpenMP with Intel® Threading Building Blocks is supported. Performance may be less than a pure OpenMP or pure Intel® Threading Building Blocks solution if the two forms of parallelism are nested. An OpenMP parallel region that plans to use the task scheduler should create a task_scheduler_init inside the parallel region, because the parallel region may create new threads unknown to Intel® Threading Building Blocks. Each of these new Task Scheduler Reference Manual 277 OpenMP threads, like native threads, must create a task_scheduler_init object before using Intel® Threading Building Blocks algorithms. The following example demonstrates how to do this. void OpenMP_Calls_TBB( int n ) { #pragma omp parallel { task_scheduler_init init; #pragma omp for for( int i=0; irefcount, and if becomes zero, puts the successor into the ready pool. c. Frees the memory of the task for reuse. 3. If the task has been marked for recycling: a. If marked by recycle_to_reexecute 31H (deprecated), puts the task back into the ready pool. b. Otherwise it was marked by recycle_as_child or recycle_as_continuation. 12.3.2 task Allocation Always allocate memory for task objects using one of the special overloaded new operators. The allocation methods do not construct the task. Instead, they return a proxy object that can be used as an argument to an overloaded version of operator new provided by the library. 282 315415-014US In general, the allocation methods must be called before any of the tasks allocated are spawned. The exception to this rule is allocate_additional_child_of(t), which can be called even if task t is already running. The proxy types are defined by the implementation. The only guarantee is that the phrase “new(proxy) T(...)”allocates and constructs a task of type T. Because these methods are used idiomatically, the headings in the subsection show the idiom, not the declaration. The argument this is typically implicit, but shown explicitly in the headings to distinguish instance methods from static methods. TIP: Allocating tasks larger than 216 bytes might be significantly slower than allocating smaller tasks. In general, task objects should be small lightweight entities. 12.3.2.1 new( task::allocate_root( task_group_context& group ) ) T Allocate a task of type T with the specified cancellation group. Figure 10 summarizes 525H1078H the state transition. null result 0 Figure 10: Effect of task::allocate_root() Use method spawn_root_and_wait (12.3.5.9) to execute the 526H1079H task. 12.3.2.2 new( task::allocate_root() ) T Like new(task::allocate_root(task_group_context&)) except that cancellation group is the current innermost cancellation group. 12.3.2.3 new( x.allocate_continuation() ) T Allocates and constructs a task of type T, and transfers the successor from x to the new task. No reference counts change. Figure 11 summarizes the state transition. 527H1080HTask Scheduler Reference Manual 283 successor null successor x x result refcount refcount 0 Figure 11: Effect of allocate_continuation() 12.3.2.4 new( x.allocate_child() ) T Effects Allocates a task with this as its successor. Figure 12 summarizes the state transition. 528H1081H x x result refcount s 0 successor successor Figure 12: Effect of allocate_child() If using explicit continuation passing, then the continuation, not the successor, should call the allocation method, so that successor is set correctly. If the number of tasks is not a small fixed number, consider building a task_list (12.5) of the predecessors first, and spawning them with a single call to 1082H task::spawn (12.3.5.5 1083H ). If a task must spawn some predecessors before all are constructed, it should use task::allocate_additional_child_of(*this) instead, because that method atomically increments refcount, so that the additional predecessor is properly accounted. However, if doing so, the task must protect against premature zeroing of refcount by using a blocking-style task pattern. 12.3.2.5 new(task::allocate_additional_child_of( y )) T Effects Allocates a task as a predecessor of another task y. Task y may be already running or have other predecessors running. Figure 13 summarizes the state transition. 1084H284 315415-014US y result refcount+1 0 y refcount Figure 13: Effect of allocate_additional_child_of(successor) Because y may already have running predecessors, the increment of y.refcount is atomic (unlike the other allocation methods, where the increment is not atomic). When adding a predecessor to a task with other predecessors running, it is up to the programmer to ensure that the successor’s refcount does not prematurely reach 0 and trigger execution of the successor before the new predecessor is added. 12.3.3 Explicit task Destruction Usually, a task is automatically destroyed by the scheduler after its method execute returns. But sometimes task objects are used idiomatically (such as for reference counting) without ever running execute. Such tasks should be disposed with method destroy. 12.3.3.1 static void destroy ( task& victim ) Requirements The refcount of victim must be zero. This requirement is checked in the debug version of the library. Effects Calls destructor and deallocates memory for victim. If victim.parent is not null, atomically decrements victim.parent->refcount. The parent is not put into the ready pool if its refcount becomes zero. Figure 14 summarizes the state transition. 532H1085HTask Scheduler Reference Manual 285 victim successor refcount successor refcount-1 f t dj t t ki d if if t i ll (can be null) Figure 14: Effect of destroy(victim). 12.3.4 Recycling Tasks It is often more efficient to recycle a task object rather than reallocate one from scratch. Often the parent can become the continuation, or one of the predecessors. CAUTION: Overlap rule: A recycled task t must not be put in jeopardy of having t.execute() rerun while the previous invocation of t.execute() is still running. The debug version of the library detects some violations of this rule. For example, t.execute() should never spawn t directly after recycling it. Instead, t.execute() should return a pointer to t, so that t is spawned after t.execute() completes. 12.3.4.1 void recycle_as_continuation() Requirements Must be called while method execute() is running. The refcount for the recycled task should be set to n, where n is the number of predecessors of the continuation task. CAUTION: The caller must guarantee that the task’s refcount does not become zero until after method execute() returns, otherwise the overlap rule 32H is broken. If the guarantee is not possible, use method recycle_as_safe_continuation() instead, and set the refcount to n+1. The race can occur for a task t when: t.execute() recycles t as a continuation. The continuation has predecessors that all complete before t.execute() returns. 286 315415-014US Hence the recycled t will be implicitly respawned with the original t.execute()still running, which breaks the overlap rule. Patterns that use recycle_as_continuation() typically avoid the race by making t.execute() return a pointer to one of the predecessors instead of explicitly spawning that predecessor. The scheduler implicitly spawns that predecessor after t.execute() returns, thus guaranteeing that the recycled t does not rerun prematurely. Effects Causes this to not be destroyed when method execute() returns. 12.3.4.2 void recycle_as_safe_continuation() Requirements Must be called while method execute() is running. The refcount for the recycled task should be set to n+1, where n is the number of predecessors of the continuation task. The additional +1 represents the task to be recycled. Effects Causes this to not be destroyed when method execute() returns. This method avoids the race discussed for recycle_as_continuation 33H because the additional +1 in the refcount prevents the continuation from executing until the original invocation of execute() completes. 12.3.4.3 void recycle_as_child_of( task& new_successor ) Requirements Must be called while method execute() is running. Effects Causes this to become a predecessor of new_successor, and not be destroyed when method execute() returns. 12.3.5 Synchronization Spawning a task t either causes the calling thread to invoke t.execute(), or causes t to be put into the ready pool. Any thread participating in task scheduling may then acquire the task and invoke t.execute(). Section 12.1 describes the structure of the 535H1086H ready pool. Task Scheduler Reference Manual 287 The calls that spawn come in two forms: • Spawn a single task. • Spawn multiple task objects specified by a task_list and clear task_list. Some calls distinguish between spawning root tasks and non-root tasks. A root task is one that was created using method allocate_root. Important A task should not spawn any predecessor task until it has called method set_ref_count to indicate both the number of predecessors and whether it intends to use one of the “wait_for_all” methods. 12.3.5.1 void set_ref_count( int count ) Requirements count=0.25F 26 If the intent is to subsequently spawn n predecessors and wait, then count should be n+1. Otherwise count should be n. Effects Sets the refcount attribute to count. 12.3.5.2 void increment_ref_count(); Effects Atomically increments refcount attribute. 12.3.5.3 int decrement_ref_count(); Effects Atomically decrements refcount attribute. Returns New value of refcount attribute. 26 Intel® TBB 2.1 had the stronger requirement count>0. 288 315415-014US NOTE: Explicit use of increment_ref_count and decrement_ref_count is typically necessary only when a task has more than one immediate successor task. Section 11.6 of the Tutorial ("General Acyclic Graphs of Tasks") explains more. 12.3.5.4 void wait_for_all() Requirements refcount=n+1, where n is the number of predecessors that are still running. Effects Executes tasks in ready pool until refcount is 1. Afterwards, leaves refcount=1 if the task’s task_group_context specifies concurrent_wait, otherwise sets refcount to 0.26F 27 536H Figure 15 summarizes the state transitions. 1087H Also, wait_for_all()automatically resets the cancellation state of the task_group_context implicitly associated with the task (12.6), when all of the 1088H following conditions hold: • The task was allocated without specifying a context. • The calling thread is a user-created thread, not an Intel® TBB worker thread. • It is the outermost call to wait_for_all() by the thread. TIP: Under such conditions there is no way to know afterwards if the task_group_context was cancelled. Use an explicit task_group_context if you need to know. 27 For sake of backwards compatibility, the default for task_group_context is not concurrent_wait, and hence to set refcount=0. Task Scheduler Reference Manual 289 this this n+1 k successor successor n previously spawned predecessors that are still running k = 0 by default k = 1 if corresponding task_group_context specifies concurrent_wait. Figure 15: Effect of wait_for_all 12.3.5.5 static void spawn( task& t ) Effects Puts task t into the ready pool and immediately returns. If the successor of t is not null, then set_ref_count must be called on that successor before spawning any child tasks, because once the child tasks commence, their completion will cause successor.refcount to be decremented asynchronously. The debug version of the library often detects when a required call to set_ref_count is not made, or is made too late. 12.3.5.6 static void spawn ( task_list& list ) Effects Equivalent to executing spawn on each task in list and clearing list, but may be more efficient. If list is empty, there is no effect. NOTE: Spawning a long linear list of tasks can introduce a bottleneck, because tasks are stolen individually. Instead, consider using a recursive pattern or a parallel loop template to create many pieces of independent work. 12.3.5.7 void spawn_and_wait_for_all( task& t ) Requirements Any other predecessors of this must already be spawned. The task t must have a non-null attribute successor. There must be a chain of successor links from t to the calling task. Typically, this chain contains a single link. That is, t is typically an immediate predecessor of this. 290 315415-014US Effects Similar to {spawn(t); wait_for_all();}, but often more efficient. Furthermore, it guarantees that task is executed by the current thread. This constraint can sometimes simplify synchronization. Figure 16 illustrates the state transitions. It is similar to 537H1089H Figure 15, with task 1090H t being the nth task. this this n+1 k t 0 successor successor n-1 previously spawned predecessors that are still running k = 0 by default k = 1 if corresponding task_group_context specifies concurrent_wait. Figure 16: Effect of spawn_and_wait_for_all 12.3.5.8 void spawn_and_wait_for_all( task_list& list ) Effects Similar to {spawn(list); wait_for_all();}, but often more efficient. 12.3.5.9 static void spawn_root_and_wait( task& root ) Requirements The memory for task root was allocated by task::allocate_root(). Effects Sets parent attribute of root to an undefined value and execute root as described in Section 12.3.1.1. Destroys 538H1091H root afterwards unless root was recycled. 12.3.5.10 static void spawn_root_and_wait( task_list& root_list ) Requirements Each task object t in root_list must meet the requirements in Section 12.3.5.9. 539H1092HTask Scheduler Reference Manual 291 Effects For each task object t in root_list, performs spawn_root_and_wait(t), possibly in parallel. Section 12.3.5.9 describes the actions of 540H1093H spawn_root_and_wait(t). 12.3.5.11 static void enqueue ( task& ) Effects The task is scheduled for eventual execution by a worker thread even if no thread ever explicitly waits for the task to complete. If the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks. Enqueued tasks are processed in roughly, but not precisely, first-come first-serve order. CAUTION: Using enqueued tasks for recursive parallelism can cause high memory usage, because the recursion will expand in a breadth-first manner. Use ordinary spawning for recursive parallelism. CAUTION: Explicitly waiting on an enqueued task should be avoided, because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion, for example, by posting a message back to the thread that enqueued it. See the Intel® Threading Building Blocks Design Patterns manual for such an example. 12.3.6 task Context These methods expose relationships between task objects, and between task objects and the underlying physical threads. 12.3.6.1 static task& self() Returns Reference to innermost task that the calling thread is running. A task is considered “running” if its methods execute(), note_affinity(), or destructor are running. If the calling thread is a user-created thread that is not running any task, self() returns a reference to an implicit dummy task associated with the thread. 12.3.6.2 task* parent() const Returns Value of the attribute successor. The result is an undefined value if the task was allocated by allocate_root and is currently running under control of spawn_root_and_wait. 292 315415-014US 12.3.6.3 void set_parent(task* p) Requirements Both tasks must be in the same task group. For example, for task t, t.group() == p->group(). Effects Sets parent task pointer to specified value p. 12.3.6.4 bool is_stolen_task() const Returns true if task is running on a thread different than the thread that spawned it. NOTE: Tasks enqueued with task::enqueue() are never reported as stolen. 12.3.6.5 task_group_context* group() Returns Descriptor of the task group, which this task belongs to. 12.3.6.6 void change_group( task_group_context& ctx ) Effects Moves the task from its current task group int the one specified by the ctx argument. 12.3.7 Cancellation A task is a quantum of work that is cancelled or executes to completion. A cancelled task skips its method execute() if that method has not yet started. Otherwise cancellation has no direct effect on the task. A task can poll task::is_cancelled() to see if cancellation was requested after it started running. Tasks are cancelled in groups as explained in Section 12.6. 1094H 12.3.7.1 bool cancel_group_execution() Effects Requests cancellation of all tasks in its group and its subordinate groups. Task Scheduler Reference Manual 293 Returns False if the task’s group already received a cancellation request; true otherwise. 12.3.7.2 bool is_cancelled() const Returns True if task’s group has received a cancellation request; false otherwise. 12.3.8 Priorities Priority levels can be assigned to individual tasks or task groups. The library supports three levels {low, normal, high} and two kinds of priority: - Static priority for enqueued 34H tasks. - Dynamic priority for task groups 35H . The former is specified by an optional argument of the task::enqueue() method, affects a specific task only, and cannot be changed afterwards. Tasks with higher priority are dequeued before tasks with lower priorities. The latter affects all the tasks in a group and can be changed at any time either via the associated task_group_context object or via any task belonging to the group. The priority-related methods in task_group_context are described in Section 12.6. 1095H The task scheduler tracks the highest priority of ready tasks (both enqueued and spawned), and postpones execution of tasks with lower priority until all higher priority task are executed. By default all tasks and task groups are created with normal priority. NOTE: Priority changes may not come into effect immediately in all threads. So it is possible that lower priority tasks are still being executed for some time even in the presence of higher priority ones. When several user threads (masters) concurrently execute parallel algorithms, the pool of worker threads is partitioned between them proportionally to the requested 36H concurrency levels. In the presence of tasks with different priorities, the pool of worker threads is proportionally divided among the masters with the highest priority first. Only after fully satisfying the requests of these higher priority masters, will the remaining threads be provided to the other masters. Though masters with lower priority tasks may be left without workers, the master threads are never stalled themselves. Task priorities also do not affect and are not affected by OS thread priority settings. NOTE: Worker thread migration from one master thread to another may not happen immediately. 294 315415-014US Related constants and methods namespace tbb { enum priority_t { priority_normal = implementation-defined, priority_low = implementation-defined, priority_high = implementation-defined }; class task { // . . . static void enqueue( task&, priority_t ); void set_group_priority ( priority_t ); priority_t group_priority () const; // . . . }; } 12.3.8.1 void enqueue ( task& t, priority_t p ) const Effects Enqueues task t at the priority level p. NOTE: Priority of an enqueued task does not affect priority of the task group, from the scope of which task::enqueue() is invoked (i.e. the group, which the task returned by task::self() 37H method belongs to). 12.3.8.2 void set_group_priority ( priority_t ) Effects Changes priority of the task group, which this task belongs to. 12.3.8.3 priority_t group_priority () const Returns Priority of the task group, which this task belongs to. 12.3.9 Affinity These methods enable optimizing for cache affinity. They enable you to hint that a later task should run on the same thread as another task that was executed earlier. To do this: Task Scheduler Reference Manual 295 1. In the earlier task, override note_affinity(id) with a definition that records id. 2. Before spawning the later task, run set_affinity(id) using the id recorded in step 1, The id is a hint and may be ignored by the scheduler. 12.3.9.1 affinity_id The type task::affinity_id is an implementation-defined unsigned integral type. A value of 0 indicates no affinity. Other values represent affinity to a particular thread. Do not assume anything about non-zero values. The mapping of non-zero values to threads is internal to the Intel® TBB implementation. 12.3.9.2 virtual void note_affinity ( affinity_id id ) The task scheduler invokes note_affinity before invoking execute() when: • The task has no affinity, but will execute on a thread different than the one that spawned it. • The task has affinity, but will execute on a thread different than the one specified by the affinity. You can override this method to record the id, so that it can be used as the argument to set_affinity(id) for a later task. Effects The default definition has no effect. 12.3.9.3 void set_affinity( affinity_id id ) Effects Sets affinity of this task to id. The id should be either 0 or obtained from note_affinity. 12.3.9.4 affinity_id affinity() const Returns Affinity of this task as set by set_affinity. 12.3.10 task Debugging Methods in this subsection are useful for debugging. They may change in future implementations. 296 315415-014US 12.3.10.1 state_type state() const CAUTION: This method is intended for debugging only. Its behavior or performance may change in future implementations. The definition of task::state_type may change in future implementations. This information is being provided because it can be useful for diagnosing problems during debugging. Returns Current state of the task. Table 41 describes valid states. Any other value is the result 541H1096H of memory corruption, such as using a task whose memory has been deallocated. Table 41: Values Returned by task::state() Value Description allocated Task is freshly allocated or recycled. ready Task is in ready pool, or is in process of being transferred to/from there. executing Task is running, and will be destroyed after method execute() returns. freed Task is on internal free list, or is in process of being transferred to/from there. reexecute Task is running, and will be respawned after method execute() returns. Figure 17 summarizes possible state transitions for a 542H1097H task. Task Scheduler Reference Manual 297 freed allocated reexecute allocate_...(t) (implicit) spawn(t) spawn_and_wait_for_all(t) return from t.execute() return from t.execute() t.recycle_to_reexecute ready executing t.recycle_as... (implicit) storage returned to heap destroy(t) allocate_...(t) storage from heap Figure 17: Typical task::state() Transitions 12.3.10.2 int ref_count() const CAUTION: This method is intended for debugging only. Its behavior or performance may change in future implementations. Returns The value of the attribute refcount.298 315415-014US 12.4 empty_task Class Summary Subclass of task that represents doing nothing. Syntax class empty_task; Header #include "tbb/task.h" Description An empty_task is a task that does nothing. It is useful as a continuation of a parent task when the continuation should do nothing except wait for its predecessors to complete. Members namespace tbb { class empty_task: public task { /*override*/ task* execute() {return NULL;} }; } 12.5 task_list Class Summary List of task objects. Syntax class task_list; Header #include "tbb/task.h" Description A task_list is a list of references to task objects. The purpose of task_list is to allow a task to create a list of tasks and spawn them all at once via the method task::spawn(task_list&), as described in 12.3.5.6. 543H1098HTask Scheduler Reference Manual 299 A task can belong to at most one task_list at a time, and on that task_list at most once. A task that has been spawned, but not started running, must not belong to a task_list. A task_list cannot be copy-constructed or assigned. Members namespace tbb { class task_list { public: task_list(); ~task_list(); bool empty() const; void push_back( task& task ); task& pop_front(); void clear(); }; } 12.5.1 task_list() Effects Constructs an empty list. 12.5.2 ~task_list() Effects Destroys the list. Does not destroy the task objects. 12.5.3 bool empty() const Returns True if list is empty; false otherwise. 12.5.4 push_back( task& task ) Effects Inserts a reference to task at back of the list. 300 315415-014US 12.5.5 task& task pop_front() Effects Removes a task reference from front of list. Returns The reference that was removed. 12.5.6 void clear() Effects Removes all task references from the list. Does not destroy the task objects. 12.6 task_group_context Summary A cancellable group of tasks. Syntax class task_group_context; Header #include “tbb/task.h” Description A task_group_context represents a group of tasks that can be cancelled or have their priority level set together. All tasks belong to some group. A task can be a member of only one group at any moment. A root task is associated with a group by passing task_group_context object into task::allocate_root() call. A child task automatically joins its parent task’s group. A task can be moved into other group using task::change_group() 38H method. The task_group_context objects form a forest of trees. Each tree’s root is a task_group_context constructed as isolated. A task_group_context is cancelled explicitly by request, or implicitly when an exception is thrown out of a task. Canceling a task_group_context causes the entire subtree rooted at it to be cancelled. Task Scheduler Reference Manual 301 The priorities for all the tasks in a group can be changed at any time either via the associated task_group_context object, or via any task belonging to the group. Priority changes propagate into the child task groups similarly to cancelation 39H . The effect of priorities on task execution is described in Section 12.3.8. 1099H Each user thread that creates a task_scheduler_init (12.2) implicitly has an 1100H isolated task_group_context that acts as the root of its initial tree. This context is associated with the dummy task returned by task::self() when the user thread is not running any task (12.3.6.1). 1101H Members namespace tbb { class task_group_context { public: enum kind_t { isolated = implementation-defined, bound = implementation-defined }; enum traits_type { exact_exception = implementation-defined, concurrent_wait = implementation-defined, #if TBB_USE_CAPTURED_EXCEPTION default_traits = 0 #else default_traits = exact_exception #endif /* !TBB_USE_CAPTURED_EXCEPTION */ }; task_group_context( kind_t relation_with_parent = bound, uintptr_t traits = default_traits ); ~task_group_context(); void reset(); bool cancel_group_execution(); bool is_group_execution_cancelled() const; void set_priority ( priority_t ); priority_t priority () const; }; } 302 315415-014US 12.6.1 task_group_context( kind_t relation_to_parent=bound, uintptr_t traits=default_traits ) Effects Constructs an empty task_group_context. If relation_to_parent is bound, the task_group_context will become a child of the innermost running task 40H ’s group when it is first passed into the call to task::allocate_root(task_group_context&). If this call is made directly from the user thread, the effect will be as if relation_to_parent were isolated. If relation_to_parent is isolated, it has no parent task_group_context. The traits argument should be the bitwise OR of traits_type values. The flag exact_exception controls how precisely exceptions are transferred between threads. See Section 13 for details. The flag 1102H concurrent_wait controls the reference-counting behavior of methods task::wait_for_all 41H and task::spawn_and_wait_for_all 42H . 12.6.2 ~task_group_context() Effects Destroys an empty task_group_context. It is a programmer error if there are still extant tasks in the group. 12.6.3 bool cancel_group_execution() Effects Requests that tasks in group be cancelled. Returns False if group is already cancelled; true otherwise. If concurrently called by multiple threads, exactly one call returns true and the rest return false. 12.6.4 bool is_group_execution_cancelled() const Returns True if group has received cancellation. Task Scheduler Reference Manual 303 12.6.5 void reset() Effects Reinitializes this to uncancelled state. CAUTION: This method is only safe to call once all tasks associated with the group's subordinate groups have completed. This method must not be invoked concurrently by multiple threads. 12.6.6 void set_priority ( priority_t ) Effects Changes priority of the task group. 12.6.7 priority_t priority () const Returns Priority of the task group. 12.7 task_scheduler_observer Summary Class that represents thread's interest in task scheduling services. Syntax class task_scheduler_observer; Header #include "tbb/task_scheduler_observer.h" Description A task_scheduler_observer permits clients to observe when a thread starts or stops participating in task scheduling. You typically derive your own observer class from task_scheduler_observer, and override virtual methods on_scheduler_entry or on_scheduler_exit. An instance has a state observing or not observing. Remember to call observe() to enable observation. 304 315415-014US Members namespace tbb { class task_scheduler_observer { public: task_scheduler_observer(); virtual ~task_scheduler_observer(); void observe( bool state=true ); bool is_observing() const; virtual void on_scheduler_entry( bool is_worker ) {} virtual void on_scheduler_exit( bool is_worker } {} }; } 12.7.1 task_scheduler_observer() Effects Constructs instance with observing disabled. 12.7.2 ~task_scheduler_observer() Effects Disables observing. Waits for extant invocations of on_scheduler_entry or on_scheduler_exit to complete. 12.7.3 void observe( bool state=true ) Effects Enables observing if state is true; disables observing if state is false. 12.7.4 bool is_observing() const Returns True if observing is enabled; false otherwise. 12.7.5 virtual void on_scheduler_entry( bool is_worker) Description Task Scheduler Reference Manual 305 The task scheduler invokes this method on each thread that starts participating in task scheduling, if observing is enabled. If observing is enabled after threads started participating, then this method is invoked once for each such thread, before it executes the first task it steals afterwards. The flag is_worker is true if the thread was created by the task scheduler; false otherwise. NOTE: If a thread enables observing before spawning a task, it is guaranteed that the thread that executes the task will have invoked on_scheduler_entry before executing the task. Effects The default behavior does nothing. 12.7.6 virtual void on_scheduler_exit( bool is_worker ) Description The task scheduler invokes this method when a thread stops participating in task scheduling, if observing is enabled. CAUTION: Sometimes on_scheduler_exit is invoked for a thread but not on_scheduler_entry. This situation can arise if a thread never steals a task. CAUTION: A process does not wait for Intel® TBB worker threads to clean up. Thus a process can terminate before on_scheduler_exit is invoked. Effects The default behavior does nothing. 12.8 Catalog of Recommended task Patterns This section catalogues recommended task patterns. In each pattern, class T is assumed to derive from class task. Subtasks are labeled t1, t2, ... tk. The subscripts indicate the order in which the subtasks execute if no parallelism is available. If parallelism is available, the subtask execution order is non-deterministic, except that t1 is guaranteed to be executed by the spawning thread. Recursive task patterns are recommended for efficient scalable parallelism, because they allow the task scheduler to unfold potential parallelism to match available 306 315415-014US parallelism. A recursive task pattern begins by creating a root task t0 and running it as follows. T& t0 = *new(allocate_root()) T(...); task::spawn_root_and_wait(*t0); The root task’s method execute() recursively creates more tasks as described in subsequent subsections. 12.8.1 Blocking Style With k Children The following shows the recommended style for a recursive task of type T where each level spawns k children. task* T::execute() { if( not recursing any further ) { ... } else { set_ref_count(k+1); task& tk = *new(allocate_child()) T(...); spawn(tk); task& tk-1= *new(allocate_child()) T(...); spawn(tk-1); ... task& t1 = *new(allocate_child()) T(...); spawn_and_wait_for_all(t1); } return NULL; } Child construction and spawning may be reordered if convenient, as long as a task is constructed before it is spawned. The key points of the pattern are: • The call to set_ref_count uses k+1 as its argument. The extra 1 is critical. • Each task is allocated by allocate_child. • The call spawn_and_wait_for_all combines spawning and waiting. A more uniform but slightly less efficient alternative is to spawn all tasks with spawn and wait by calling wait_for_all. 12.8.2 Continuation-Passing Style With k Children There are two recommended styles. They differ in whether it is more convenient to recycle the parent as the continuation or as a child. The decision should be based upon whether the continuation or child acts more like the parent. Task Scheduler Reference Manual 307 Optionally, as shown in the following examples, the code can return a pointer to one of the children instead of spawning it. Doing so causes the child to execute immediately after the parent returns. This option often improves efficiency because it skips pointless overhead of putting the task into the task pool and taking it back out. 12.8.2.1 Recycling Parent as Continuation This style is useful when the continuation needs to inherit much of the state of the parent and the child does not need the state. The continuation must have the same type as the parent. task* T::execute() { if( not recursing any further ) { ... return NULL; } else { set_ref_count(k); recycle_as_continuation(); task& tk = *new(allocate_child()) T(...); spawn(tk); task& tk-1 = *new(allocate_child()) T(...); spawn(tk-1); ... // Return pointer to first child instead of spawning it, // to remove unnecessary overhead. task& t1 = *new(allocate_child()) T(...); return &t1; } } The key points of the pattern are: • The call to set_ref_count uses k as its argument. There is no extra +1 as there is in blocking style discussed in Section 12.8.1. 544H1103H • Each child task is allocated by allocate_child. • The continuation is recycled from the parent, and hence gets the parent's state without doing copy operations. 12.8.2.2 Recycling Parent as a Child This style is useful when the child inherits much of its state from a parent and the continuation does not need the state of the parent. The child must have the same type as the parent. In the example, C is the type of the continuation, and must derive from class task. If C does nothing except wait for all children to complete, then C can be the class empty_task (12.4). 545H1104H task* T::execute() { if( not recursing any further ) { 308 315415-014US ... return NULL; } else { // Construct continuation C& c = allocate_continuation(); c.set_ref_count(k); // Recycle self as first child task& tk = *new(c.allocate_child()) T(...); spawn(tk); task& tk-1 = *new(c.allocate_child()) T(...); spawn(tk-1); ... task& t2 = *new(c.allocate_child()) T(...); spawn(t2); // task t1 is our recycled self. recycle_as_child_of(c); update fields of *this to subproblem to be solved by t1 return this; } } The key points of the pattern are: • The call to set_ref_count uses k as its argument. There is no extra 1 as there is in blocking style discussed in Section 12.8.1. 546H1105H • Each child task except for t1 is allocated by c.allocate_child. It is critical to use c.allocate_child, and not (*this).allocate_child; otherwise the task graph will be wrong. • Task t1 is recycled from the parent, and hence gets the parent's state without performing copy operations. Do not forget to update the state to represent a child subproblem; otherwise infinite recursion will occur. 12.8.3 Letting Main Thread Work While Child Tasks Run Sometimes it is desirable to have the main thread continue execution while child tasks are running. The following pattern does this by using a dummy empty_task (12.4). 1106H task* dummy = new( task::allocate_root() ) empty_task; dummy->set_ref_count(k+1); task& tk = *new( dummy->allocate_child() ) T; dummy->spawn(tk); task& tk-1= *new( dummy->allocate_child() ) T; dummy->spawn(tk-1); ... task& t1 = *new( dummy->allocate_child() ) T; dummy->spawn(t1); ...do any other work... dummy->wait_for_all(); dummy->destroy(*dummy); The key points of the pattern are: Task Scheduler Reference Manual 309 • The dummy task is a placeholder and never runs. • The call to set_ref_count uses k+1 as its argument. • The dummy task must be explicitly destroyed. 310 315415-014US 13 Exceptions Intel® Threading Building Blocks (Intel® TBB) propagates exceptions along logical paths in a tree of tasks. Because these paths cross between thread stacks, support for moving an exception between stacks is necessary. When an exception is thrown out of a task, it is caught inside the Intel® TBB run-time and handled as follows: 1. If the cancellation group for the task has already been cancelled, the exception is ignored. 2. Otherwise the exception or an approximation of it is captured. 3. The captured exception is rethrown from the root of the cancellation group after all tasks in the group have completed or have been successfully cancelled. The exact exception is captured when both of the following conditions are true: • The task’s task_group_context was created in a translation unit compiled with TBB_USE_CAPTURED_EXCEPTION 43H =0. • The Intel® TBB library was built with a compiler that supports the std::exception_ptr feature of C++ 200x. Otherwise an appoximation of the original exception x is captured as follows: 1. If x is a tbb_exception, it is captured by x.move(). 2. If x is a std::exception, it is captured as a tbb::captured_exception(typeid(x).name(),x.what()). 3. Otherwise x is captured as a tbb::captured exception with implementationspecified value for name() and what(). 13.1 tbb_exception Summary Exception that can be moved to another thread. Syntax class tbb_exception; Exceptions Reference Manual 311 Header #include "tbb/tbb_exception.h" Description In a parallel environment, exceptions sometimes have to be propagated across threads. Class tbb_exception subclasses std::exception to add support for such propagation. Members namespace tbb { class tbb_exception: public std::exception { virtual tbb_exception* move() = 0; virtual void destroy() throw() = 0; virtual void throw_self() = 0; virtual const char* name() throw() = 0; virtual const char* what() throw() = 0; }; } Derived classes should define the abstract virtual methods as follows: • move() should create a pointer to a copy of the exception that can outlive the original. It may move the contents of the original. • destroy() should destroy a copy created by move(). • throw_self() should throw *this. • name() typically returns the RTTI name of the originally intercepted exception. • what() returns a null-terminated string describing the exception. 13.2 captured_exception Summary Class used by Intel® TBB to capture an approximation of an exception. Syntax class captured_exception; Header #include "tbb/tbb_exception.h" 312 315415-014US Description When a task throws an exception, sometimes Intel® TBB converts the exception to a captured_exception before propagating it. The conditions for conversion are described in Section 13. 1107H Members namespace tbb { class captured_exception: public tbb_exception { captured_exception(const captured_exception& src); captured_exception(const char* name, const char* info); ~captured_exception() throw(); captured_exception& operator=(const captured_exception&); captured_exception* move() throw(); void destroy() throw(); void throw_self(); const char* name() const throw(); const char* what() const throw(); }; } Only the additions that captured_exception makes to tbb_exception are described here. Section 13.1 describes the rest of the interface. 1108H 13.2.1 captured_exception( const char* name, const char* info ) Effects Constructs a captured_exception with the specified name and info. 13.3 movable_exception Summary Subclass of tbb_exception interface that supports propagating copy-constructible data. Syntax template class movable_exception; Exceptions Reference Manual 313 Header #include "tbb/tbb_exception.h" Description This template provides a convenient way to implement a subclass of tbb_exception that propagates arbitrary copy-constructible data. Members namespace tbb { template class movable_exception: public tbb_exception { public: movable_exception( const ExceptionData& src ); movable_exception( const movable_exception& src )throw(); ~movable_exception() throw(); movable_exception& operator=( const movable_exception& src ); ExceptionData& data() throw(); const ExceptionData& data() const throw(); movable_exception* move() throw(); void destroy() throw(); void throw_self(); const char* name() const throw(); const char* what() const throw(); }; } Only the additions that movable_exception makes to tbb_exception are described here. Section 13.1 describes the rest of the interface. 1109H 13.3.1 movable_exception( const ExceptionData& src ) Effects Construct movable_exception containing copy of src. 13.3.2 ExceptionData& data() throw() Returns Reference to contained data. 314 315415-014US 13.3.3 const ExceptionData& data() const throw() Returns Const reference to contained data. 13.4 Specific Exceptions Summary Exceptions thrown by other library components. Syntax class bad_last_alloc; class improper_lock; class invalid_multiple_scheduling; class missing_wait; Header #include "tbb/tbb_exception.h" Description Table 42 describes when the exceptions are thrown. 1110H Table 42: Classes for Specific Exceptions. Exception Thrown when... bad_last_alloc • A pop operation on a concurrent_queue or concurrent_bounded_queue corrersponds to a push that threw an exception. • An operation on a concurrent_vector cannot be performed because a prior operation threw an exception. improper_lock A thread attempts to lock a critical_section or reader_writer_lock that it it has already locked. invalid_multiple_scheduling A task_group or structured_task_group attempts to run a task_handle twice. Exceptions Reference Manual 315 missing_wait A task_group or structured_task_group is destroyed before method wait() is invoked. Members namespace tbb { class bad_last_alloc: public std::bad_alloc { public: const char* what() const throw(); }; class improper_lock: public std::exception { public: const char* what() const throw(); }; class invalid_multiple_scheduler: public std::exception { const char* what() const throw(); }; class missing_wait: public std::exception { public: const char* what() const throw(); }; } 316 315415-014US 14 Threads Intel® Threading Building Blocks (Intel® TBB) provides a wrapper around the platform’s native threads, based upon the N3000 44H working draft for C++ 200x. Using this wrapper has two benefits: • It makes threaded code portable across platforms. • It eases later migration to ISO C++ 200x threads. The library defines the wrapper in namespace std, not namespace tbb, as explained in Section 2.4.7. 45H1111H 27F 28 The significant departures from N3000 are shown in Table 43. 1112H Table 43: Differences Between N3000 and Intel® TBB Thread Class N3000 Intel® TBB template std::this_thread::sleep_for( const chrono::duration& rel_time) std::this_thread::sleep_for( tick_count::interval_t ) rvalue reference parameters Parameter changed to plain value, or function removed, as appropriate. constructor for std::thread takes arbitrary number of arguments. constructor for std::thread takes 0-3 arguments. The other changes are for compatibility with the current C++ standard or Intel® TBB. For example, constructors that have an arbitrary number of arguments require the variadic template features of C++ 200x. CAUTION: Threads are heavy weight entities on most systems, and running too many threads on a system can seriously degrade performance. Consider using a task based solution instead if practical. 28 In Intel® TBB 2.2, the class was tbb::tbb_thread. Appendix A.7 explains the changes. Threads Reference Manual 317 14.1 thread Class Summary Represents a thread of execution. Syntax class thread; Header #include "tbb/compat/thread" Description Class thread provides a platform independent interface to native threads. An instance represents a thread. A platform-specific thread handle can be obtained via method native_handle(). Members namespace std { class thread { public: #if _WIN32||_WIN64 typedef HANDLE native_handle_type; #else typedef pthread_t native_handle_type; #endif // _WIN32||_WIN64 class id; thread(); template explicit thread(F f); template thread(F f, X x); template thread (F f, X x, Y y); thread& operator=( thread& x); ~thread(); bool joinable() const; void join(); void detach(); id get_id() const; native_handle_type native_handle(); static unsigned hardware_concurrency(); 318 315415-014US }; } 14.1.1 thread() Effects Constructs a thread that does not represent a thread of execution, with get_id()==id(). 14.1.2 template thread(F f) Effects Construct a thread that evaluates f() 14.1.3 template thread(F f, X x) Effects Constructs a thread that evaluates f(x). 14.1.4 template thread(F f, X x, Y y) Effects Constructs thread that evaluates f(x,y). 14.1.5 thread& operator=(thread& x) Effects If joinable(), calls detach(). Then assigns the state of x to *this and sets x to default constructed state. CAUTION: Assignment moves the state instead of copying it. Threads Reference Manual 319 14.1.6 ~thread Effects if( joinable() ) detach(). 14.1.7 bool joinable() const Returns get_id()!=id() 14.1.8 void join() Requirements joinable()==true Effects Wait for thread to complete. Afterwards, joinable()==false. 14.1.9 void detach() Requirements joinable()==true Effects Sets *this to default constructed state and returns without blocking. The thread represented by *this continues execution. 14.1.10 id get_id() const Returns id of the thread, or a default-constructed id if *this does not represent a thread. 320 315415-014US 14.1.11 native_handle_type native_handle() Returns Native thread handle. The handle is a HANDLE on Windows* operating systems and a pthread_t on Linux* and Mac OS* X operating systems. For these systems, native_handle() returns 0 if joinable()==false. 14.1.12 static unsigned hardware_concurrency() Returns The number of hardware threads. For example, 4 on a system with a single Intel® Core™2 Quad processor. 14.2 thread::id Summary Unique identifier for a thread. Syntax class thread::id; Header #include "tbb/compat/thread" Description A thread::id is an identifier value for a thread that remains unique over the thread’s lifetime. A special value thread::id() represents no thread of execution. The instances are totally ordered. Members namespace tbb { class thread::id { public: id(); }; template std::basic_ostream& operator<< (std::basic_ostream &out, thread::id id) Threads Reference Manual 321 bool operator==(thread::id x, thread::id y); bool operator!=(thread::id x, thread::id y); bool operator<(thread::id x, thread::id y); bool operator<=(thread::id x, thread::id y); bool operator>(thread::id x, thread::id y); bool operator>=(thread::id x, thread::id y); } // namespace tbb 14.3 this_thread Namespace Description Namespace this_thread contains global functions related to threading. Members namepace tbb { namespace this_thread { thread::id get_id(); void yield(); void sleep( const tick_count::interval_t ); } } 14.3.1 thread::id get_id() Returns Id of the current thread. 14.3.2 void yield() Effects Offers to suspend current thread so that another thread may run. 14.3.3 void sleep_for( const tick_count::interval_t & i) Effects Current thread blocks for at least time interval i. 322 315415-014US Example using namespace tbb; void Foo() { // Sleep 30 seconds this_thread::sleep_for( tick_count::interval_t(30) ); } References Reference Manual 323 15 References Umut A. Acar, Guy E. Blelloch, Robert D. Blumofe, The Data Locality of Work Stealing. ACM Symposium on Parallel Algorithms and Architectures (2000):1-12. Robert D.Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (July 1995):207–216. Working Draft, Standard for Programming Language C++. WG21 document N3000. Steve MacDonald, Duane Szafron, and Jonathan Schaeffer. Rethinking the Pipeline as Object-Oriented States with Transformations. 9th International Workshop on HighLevel Parallel Programming Models and Supportive Environments (April 2004):12-21. W.F. Burton and R.M. Sleep. Executing functional programs on a virtual tree of processors. Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (October 1981):187-194. ISO/IEC 14882, Programming Languages – C++ Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, Lawrence Rauchwerger. STAPL: An Adaptive, Generic Parallel C++ Library. Workshop on Language and Compilers for Parallel Computing (LCPC 2001), Cumberland Falls, Kentucky Aug 2001. Lecture Notes in Computer Science 2624 (2003): 193-208. S. G. Akl and N. Santoro, Optimal Parallel Merging and Sorting Without Memory Conflicts, IEEE Transactions on Computers, Vol. C-36 No. 11, Nov. 1987. 324 315415-014US Appendix A Compatibility Features This appendix describes features of Intel Threading Building Blocks (Intel® TBB) that remain for compatibility with previous versions. These features are deprecated and may disappear in future versions of Intel® TBB. Some of these features are available only if the preprocessor symbol TBB_DEPRECATED is non-zero. A.1 parallel_while Template Class Summary Template class that processes work items. TIP: This class is deprecated. Use parallel_do (4.7) instead. 1113H Syntax template class parallel_while; Header #include "tbb/parallel_while.h" Description A parallel_while performs parallel iteration over items. The processing to be performed on each item is defined by a function object of type Body. The items are specified in two ways: • A stream of items. • Additional items that are added while the stream is being processed. Table 44 shows the requirements on the stream and body. 477H1114H Table 44: parallel_while Requirements for Stream S and Body B Pseudo-Signature Semantics bool S::pop_if_present( B::argument_type& item ) Get next stream item. parallel_while does not concurrently invoke the method on the same this. B::operator()( B::argument_type& item ) Process item. parallel_whileReferences Reference Manual 325 Pseudo-Signature Semantics const may concurrently invoke the operator for the same this but different item. B::argument_type() Default constructor. B::argument_type( const B::argument_type& ) Copy constructor. ~B::argument_type() Destructor. For example, a unary function object, as defined in Section 20.3 of the C++ standard, models the requirements for B. A concurrent_queue (5.5) models the requirements 1115H for S. TIP: To achieve speedup, the grainsize of B::operator() needs to be on the order of at least ~10,000 instructions. Otherwise, the internal overheads of parallel_while swamp the useful work. The parallelism in parallel_while is not scalable if all the items come from the input stream. To achieve scaling, design your algorithm such that method add often adds more than one piece of work. Members namespace tbb { template class parallel_while { public: parallel_while(); ~parallel_while(); typedef typename Body::argument_type value_type; template void run( Stream& stream, const Body& body ); void add( const value_type& item ); }; } A.1.1 parallel_while() Effects Constructs a parallel_while that is not yet running. 326 315415-014US A.1.2 ~parallel_while() Effects Destroys a parallel_while. A.1.3 Template void run( Stream& stream, const Body& body ) Effects Applies body to each item in stream and any other items that are added by method add. Terminates when both of the following conditions become true: • stream.pop_if_present returned false. • body(x) returned for all items x generated from the stream or method add. A.1.4 void add( const value_type& item ) Requirements Must be called from a call to body.operator() created by parallel_while. Otherwise, the termination semantics of method run are undefined. Effects Adds item to collection of items to be processed. A.2 Interface for constructing a pipeline filter The interface for constructing a filter evolved over several releases of Intel® TBB. The two following subsections describe obsolete aspects of the interface. A.2.1 filter::filter( bool is_serial ) Effects Constructs a serial in order filter if is_serial is true, or a parallel filter if is_serial is false. This deprecated constructor is superseded by the constructor filter( filter::mode ) described in Section 4.9.6.1. 1116HReferences Reference Manual 327 A.2.2 filter::serial The filter mode value filter::serial is now named filter::serial_in_order. The new name distinguishes it more clearly from the mode filter::serial_out_of_order. A.3 Debugging Macros The names of the debugging macros have changed as shown in Table 45. If you define 1117H the old macros, Intel® TBB sets each undefined new macro in a way that duplicates the behavior the old macro settings. The old TBB_DO_ASSERT enabled assertions, full support for Intel® Threading Tools, and performance warnings. These three distinct capabilities are now controlled by three separate macros as described in Section 3.2. 1118H TIP: To enable all three capabilities with a single macro, define TBB_USE_DEBUG to be 1. If you had code under “#if TBB_DO_ASSERT” that should be conditionally included only when assertions are enabled, use “#if TBB_USE_ASSERT” instead. Table 45: Deprecated Macros Deprecated Macro New Macro TBB_DO_ASSERT TBB_USE_DEBUG or TBB_USE_ASSERT, depending on context. TBB_DO_THREADING_TOOLS TBB_USE_THREADING_TOOLS A.4 tbb::deprecated::concurrent_queu e Template Class Summary Template class for queue with concurrent operations. This is the concurrent_queue supported in Intel® TBB 2.1 and prior. New code should use the Intel® TBB 2.2 unbounded concurrent_queue or concurrent_bounded_queue. Syntax template > class concurrent_queue; Header #include "tbb/concurrent_queue.h" 328 315415-014US Description A tbb::deprecated::concurrent_queue is a bounded first-in first-out data structure that permits multiple threads to concurrently push and pop items. The default bounds are large enough to make the queue practically unbounded, subject to memory limitations on the target machine. NOTE: Compile with TBB_DEPRECATED=1 to inject tbb::deprecated::concurrent_queue into namespace tbb. Consider eventually migrating to the new queue classes. • Use the new tbb::concurrent_queue if you need only the non-blocking operations (push and try_pop) for modifying the queue. • Otherwise use the new tbb::concurrent_bounded_queue. It supports both blocking operations (push and try_pop) and non-blocking operations. In both cases, use the new method names in Table 46. 1119H Table 46: Method Name Changes for Concurrent Queues Method in tbb::deprecated::concurrent_queue Equivalent method in tbb::concurrent_queue or tbb::concurrent_bounded_queue pop_if_present try_pop push_if_not_full try_push (not available in tbb::concurrent_queue) begin unsafe_begin end unsafe_end Members namespace tbb { namespace deprecated { template > class concurrent_queue { public: // types typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef std::ptrdiff_t size_type; typedef std::ptrdiff_t difference_type; concurrent_queue(const Alloc& a = Alloc()); concurrent_queue(const concurrent_queue& src, const Alloc& a = Alloc()); template concurrent_queue(InputIterator first, InputIterator last, References Reference Manual 329 const Alloc& a = Alloc()); ~concurrent_queue(); void push(const T& source); bool push_if_not_full(const T& source); void pop(T& destination); bool pop_if_present(T& destination); void clear() ; size_type size() const; bool empty() const; size_t capacity() const; void set_capacity(size_type capacity); Alloc get_allocator() const; typedef implementation-defined iterator; typedef implementation-defined const_iterator; // iterators (these are slow and intended only for debugging) iterator begin(); iterator end(); const_iterator begin() const; const_iterator end() const; }; } #if TBB_DEPRECATED using deprecated::concurrent_queue; #else using strict_ppl::concurrent_queue; #endif } A.5 Interface for concurrent_vector The return type of methods grow_by and grow_to_at_least changed in Intel® TBB 2.2. Compile with the preprocessor symbol TBB_DEPRECATED set to nonzero to get the old methods. 330 315415-014US Table 47: Change in Return Types Method Deprecated Return Type New Return Type grow_by (5.8.3.1) 1120H size_type iterator grow_to_at_least (5.8.3.2) 1121H void iterator push_back (5.8.3.3) 1122H size_type iterator A.5.1 void compact() Effects Same as shrink_to_fit() (5.8.2.2). 1123H A.6 Interface for class task Some methods of class task are deprecated because they have obsolete or redundant functionality. Deprecated Members of class task namespace tbb { class task { public: ... void recycle_to_reexecute(); // task depth typedef implementation-defined-signed-integral-type depth_type; depth_type depth() const {return 0;} void set_depth( depth_type new_depth ) {} void add_to_depth( int delta ){} ... }; } A.6.1 void recycle _to_reexecute() Intel® TBB 3.0 deprecated method recycle_to_reexecute because it is redundant. Replace a call t->recycle_to_reexecute()with the following sequence: t->set_refcount(1); References Reference Manual 331 t->recycle_as_safe_continuation(); A.6.2 Depth interface for class task Intel® TBB 2.2 eliminated the notion of task depth that was present in prior versions of Intel® TBB. The members of class task that related to depth have been retained under TBB_DEPRECATED, but do nothing. A.7 tbb_thread Class Intel® TBB 3.0 introduces a header tbb/compat/thread that defines class std::thread. Prior versions had a header tbb/tbb_thread.h that defined class tbb_thread. The old header and names are still available, but deprecated in favor of the replacements shown inTable 48. 1124H Table 48: Replacements for Deprecated Names Entity Deprecated Replacement Header tbb/tbb_thread.h tbb/compat/thread tbb::tbb_thread std::thread Identifiers tbb::this_tbb_thread std::this_thread tbb::this_tbb_thread::sleep std::this_tbb_thread::sleep_for Most of the changes reflect a change in the way that the library implements C++ 200x features (2.4.7). The change from 46H1125H sleep to sleep_for reflects a change in the C++ 200x working draft. 332 315415-014US Appendix B PPL Compatibility Intel Threading Building Blocks (Intel® TBB) 2.2 introduces features based on joint discussions between the Microsoft Corporation and Intel Corporation. The features establish some degree of compatibility between Intel® TBB and Microsoft Parallel Patterns Library (PPL) development software. Table 49 lists the features. Each feature appears in namespace 1126H tbb. Each feature can be injected into namespace Concurrency by including the file "tbb/compat/ppl.h" Table 49: PPL Compatibility Features Section Feature 4.4 parallel_for( 1127H first,last, f) 4.4 parallel_for( 1128H first,last,step,f) 4.8 parallel_for_each 1129H 4.12 parallel_invoke 1130H 9.3.1 critical_section 1131H 9.3.2 reader_writer_lock 1132H 11.3 task_handle 1133H 11.2 task_group_status 1134H 11.1.1 task_group 1135H 11.4 make_task 1136H 11.5 structured_task_group 1137H 11.6 is_current_task_group_cancelling 1138H 13.4 improper_lock 1139H 13.4 invalid_multiple_scheduling 1140H 13.4 missing_wait 1141H For parallel_for, only the variants listed in the table are injected into namespace Concurrency. CAUTION: Because of different environments and evolving specifications, the behavior of the features can differ between the Intel® TBB and PPL implementations. References Reference Manual 333 Appendix C Known Issues This section explains known issues with using Intel® Threading Building Blocks (Intel® TBB). C.1 Windows* OS Some Intel® TBB header files necessarily include the header file , which by default defines the macros min and max, and consequently breaks the ISO C++ header files and . Defining the preprocessor symbol NOMINMAX causes to not define the offending macros. Thus programs using Intel® TBB and either of the aforementioned ISO C++ headers should be compiled with /DNOMINMAX as a compiler argument. 334 315415-014US Appendix D Community Preview Features This section provides documentation for Community Preview (CP) features. What is a Community Preview Feature? A Community Preview feature is a component of Intel® Threading Building Blocks (Intel® TBB) that is being introduced to gain early feedback from developers. Comments, questions and suggestions related to Community Preview features are encouraged and should be submitted to the forums at www.threadingbuildingblocks.org 47H . The key properties of a CP feature are: • It must be explicitly enabled. It is off by default. • It is intended to have a high quality implementation. • There is no guarantee of future existence or compatibility. • It may have limited or no support in tools such as correctness analyzers, profilers and debuggers. CAUTION: A CP feature is subject to change in the future. It may be removed or radically altered in future releases of the library. Changes to a CP feature do NOT require the usual deprecation and deletion process. Using a CP feature in a production code base is therefore strongly discouraged. Enabling a Community Preview Feature A Community Preview feature may be defined completely in header files or it may require some additional support defined in a library. For a CP feature that is contained completely in header files, a feature-specific macro must be defined before inclusion of the header files. Example #define TBB_PREVIEW_FOO 1 #include “tbb/foo.h” If a CP feature requires support from a library, then an additional library must be linked with the application. The use of separate headers, feature-specific macros and separate libraries mitigates the impact of Community Preview features on other product features. References Reference Manual 335 NOTE: Unless a CP feature is explicitly enabled using the above mechanisms, it will have no impact on the application. D.1 Flow Graph This section describes Flow Graph nodes that are available as Community Preview features. D.1.1 or_node Template Class Summary A node that broadcasts messages received at its input ports to all of its successors. Each input port pi is a receiver. The messages are broadcast individually as they are received at each port. The output message types is a struct that contains an index number that identifies the port on which the message arrived and a tuple of the input types where the value is stored. Syntax template class or_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" Description An or_node is a graph_node and sender< or_node::output_type >. It contains a tuple of input ports, each of which is a receiver for each of the T0 .. TN in InputTuple. It supports multiple input receivers with distinct types and broadcasts each received message to all of its successors. Unlike a join_node, each message is broadcast individually to all successors of the or_node as it arrives at an input port. The incoming messages are wrapped in a struct that contains the index of the port number on which the message arrived and a tuple of the input types where the received value is stored. The function template input_port described in 6.19 simplifies the syntax for getting a 1142H reference to a specific input port. Rejection of messages by successors of the or_node is handled using the protocol in Figure 4. The input ports never reject incoming messages. 1143H InputTuple must be a std::tuple where each element is copyconstructible and assignable. 336 315415-014US Example #include #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; int main() { graph g; function_node f1( g, unlimited, [](const int &i) { return 2*i; } ); function_node f2( g, unlimited, [](const float &f) { return f/2; } ); typedef or_node< std::tuple > my_or_type; my_or_type o; function_node< my_or_type::output_type > f3( g, unlimited, []( const my_or_type::output_type &v ) { if (v.indx == 0) { printf("Received an int %d\n", std::get<0>(v.result)); } else { printf("Received a float %f\n", std::get<1>(v.result)); } } ); make_edge( f1, input_port<0>(o) ); make_edge( f2, input_port<1>(o) ); make_edge( o, f3 ); f1.try_put( 3 ); f2.try_put( 3 ); g.wait_for_all(); return 0; } In the example above, three function_node objects are created: f1 multiplies an int i by 2, f2 divides a float f by 2, and f3 prints the values from f1 and f2 as they arrive. The or_node j wraps the output of f1 and f2 and forwards each result to f3. This example is purely a syntactic demonstration since there is very little work in the nodes. References Reference Manual 337 Members namespace tbb { namespace flow { template class or_node : public graph_node, public sender< impl-dependent-output-type > { public: typedef struct { size_t indx; InputTuple result; } output_type; typedef receiver successor_type; implementation-dependent-tuple input_ports_tuple_type; or_node(); or_node(const or_node &src); input_ports_tuple_type &inputs(); bool register_successor( successor_type &r ); bool remove_successor( successor_type &r ); bool try_get( output_type &v ); bool try_reserve( output_type & ); bool try_release( ); bool try_consume( ); }; } } D.1.1.1 or_node( ) Effect Constructs an or_node. D.1.1.2 or_node( const or_node &src ) Effect Constructs an or_node. The list of predecessors, messages in the input ports, and successors are NOT copied. 338 315415-014US D.1.1.3 input_ports_tuple_type& inputs() Returns A std::tuple of receivers. Each element inherits from tbb::receiver where T is the type of message expected at that input. Each tuple element can be used like any other flow::receiver. D.1.1.4 bool register_successor( successor_type & r ) Effect Adds r to the set of successors. Returns true. D.1.1.5 bool remove_successor( successor_type & r ) Effect Removes r from the set of successors. Returns true. D.1.1.6 bool try_get( output_type &v ) Description An or_node contains no buffering and therefore does not support gets. Returns false. D.1.1.7 bool try_reserve( T & ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. References Reference Manual 339 D.1.1.8 bool try_release( ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. D.1.1.9 bool try_consume( ) Description An or_node contains no buffering and therefore cannot be reserved. Returns false. D.1.2 multioutput_function_node Template Class Summary A template class that is a receiver and has a tuple of sender outputs. This node may have concurrency limits as set by the user. When the concurrency limit allows, it executes the user-provided body on incoming messages. The body may create one or more output messages and broadcast them to successors.. Syntax template < typename InputType, typename OutputTuple > class multioutput_function_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" Description This type is used for nodes that receive messages at a single input port and may generate one or more messages that are broadcast to successors. A multioutput_function_node maintains an internal constant threshold T and an internal counter C. At construction, C=0 and T is set the value passed in to the constructor. The behavior of a call to try_put is determined by the value of T and C as shown in Table 50. 1144H340 315415-014US Table 50: Behavior of a call to a multioutput_function_node’s try_put Value of threshold T Value of counter C bool try_put( input_type v ) T == graph::unlimited NA A task is enqueued that executes body(v). Returns true. T != flow::unlimited C < T Increments C. A task is enqueued that executes body(v) and then decrements C. Returns true. T != flow::unlimited C >= T Returns false. A multioutput_function_node has a user-settable concurrency limit. It can have flow::unlimited concurrency, which allows an unlimited number of copies of the node to execute concurrently. It can have flow::serial concurrency, which allows only a single copy of the node to execute concurrently. The user can also provide a value of type size_t to limit concurrency to a value between 1 and unlimited. The Body concept for multioutput_function_node is shown in Table 51. 1145H Table 51: multioutput_function_node Body Concept Pseudo-Signature Semantics B::B( const B& ) Copy constructor. B::~B() Destructor. void28F 29 operator=( const B& ) Assignment void B::operator()(const InputType &v, output_ports &p) Perform operation on v. May call try_put on zero or more output_ports. May call try_put on output_ports multiple times.. Example The example below shows a multioutput_function_node that separates a stream of integers into odd and even, placing each in the appropriate output queue. The Body method will receive as parameters a read-only reference to the input value and a reference to the tuple of output ports. The Body method may put items to one or more output ports. The output ports of the multioutput_function_node can be connected to other graph nodes using the make_edge method or by using register_successor: References Reference Manual 341 #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; typedef multioutput_function_node > multi_node; struct MultiBody { void operator()(const int &i, multi_node::output_ports_type &op) { if(i % 2) std::get<1>(op).put(i); // put to odd queue else std::get<0>(op).put(i); // put to even queue } }; int main() { graph g; queue_node even_queue(g); queue_node odd_queue(g); multi_node node1(g,unlimited,MultiBody()); output_port<0>(node1).register_successor(even_queue); make_edge(output_port<1>(node1), odd_queue); for(int i = 0; i < 1000; ++i) { node1.try_put(i); } g.wait_for_all(); } Members namespace tbb { template< typename InputType, typename OutputTuple, graph_buffer_policy=queueing, A> class multioutput_function_node : public graph_node, public receiver, { public: typedef (input_queue) queue_type; template multioutput_function_node( graph &g, size_t concurrency, Body body, queue_type *q = NULL ); 342 315415-014US multioutput_function_node( const multioutput_function_node &other, queue_type *q = NULL); ~multioutput_function_node(); typedef InputType input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); typedef OutputType tuple_port_types; typedef (tuple of sender) output_ports_type; template &output_port(MFN &node); } D.1.2.1 template< typename Body> multioutput_function_node(graph &g, size_t concurrency, Body body, queue_type *q = NULL) Description Constructs a multioutput_function_node that will invoke body. At most concurrency calls to the body may be made concurrently. D.1.2.2 template< typename Body> multioutput_function_node(multioutput_function_node const & other, queue_type *q = NULL) Effect Constructs a copy of a multioutput_function_node with an optional input queue. D.1.2.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. References Reference Manual 343 D.1.2.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. D.1.2.5 bool try_put( input_type v ) Effect If fewer copies of the node exist than the allowed concurrency, a task is spawned to execute body on the v. The body may put results to one or more successors in the tuple of output ports. Returns true. D.1.2.6 (output port &) output_port(node) Returns A reference to port N of the multioutput_function_node node. D.1.3 split_node Template Class Summary A template class that is a receiver and has a tuple of sender outputs. A split_node is a multifunction_output_node with a body that sends each element of the incoming tuple to the output port that matches the element’s index in the incoming tuple. This node has unlimited concurrency. Syntax template < typename InputType > class split_node; Header #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" 344 315415-014US Description This type is used for nodes that receive tuples at a single input port and generate a message from each element of the tuple, passing each to its corresponding output port. A split_node has unlimited concurrency, no buffering, and behaves as a broadcast_node with multiple output ports. Example The example below shows a split_node that separates a stream of tuples of integers, placing each element of the tuple in the appropriate output queue. The output ports of the split_node can be connected to other graph nodes using the make_edge method or by using register_successor: #define TBB_PREVIEW_GRAPH_NODES 1 #include "tbb/flow_graph.h" using namespace tbb::flow; typedef split_node< std::tuple > s_node; int main() { typedef std::tuple int_tuple_type; graph g; queue_node first_queue(g); queue_node second_queue(g); s_node node1(g); output_port<0>(node1).register_successor(first_queue); make_edge(output_port<1>(node1), second_queue); for(int i = 0; i < 1000; ++i) { node1.try_put(int_tuple_type(2*i,2*i+1)); } g.wait_for_all(); } Members namespace tbb { template< typename InputType, A > class split_node : References Reference Manual 345 public multioutput_function_node { public: split_node( graph &g); split_node( const split_node &other); ~split_node(); typedef InputType input_type; typedef sender predecessor_type; bool try_put( input_type v ); bool register_predecessor( predecessor_type &p ); bool remove_predecessor( predecessor_type &p ); typedef OutputType tuple_port_types; typedef (tuple of sender) output_ports_type; template &output_port(MFN &node); } D.1.3.1 split_node(graph &g) Description Constructs a split_node. D.1.3.2 split_node(split_node const & other) Effect Constructs a copy of a split_node. D.1.3.3 bool register_predecessor( predecessor_type & p ) Effect Adds p to the set of predecessors. Returns true. 346 315415-014US D.1.3.4 bool remove_predecessor( predecessor_type & p ) Effect Removes p from the set of predecessors. Returns true. D.1.3.5 bool try_put( input_type v ) Effect Forwards each element of the input tuple v to the corresponding output port. Returns true. D.1.3.6 (output port &) output_port(node) Returns A reference to port N of the split_node. D.2 Run-time loader Summary The run-time loader is a mechanism that provides additional run-time control over the version of the Intel ® Threading Buidling Blocks (Intel® TBB) dynamic library used by an application, plug-in, or another library. Header #define TBB_PREVIEW_RUNTIME_LOADER 1 #include “tbb/runtime_loader.h” Library OS Release build Debug build Windows tbbproxy.lib tbbproxy_debug.lib References Reference Manual 347 Description The run-time loader consists of a class and a static library that can be linked with an application, library, or plug-in to provide better run-time control over the version of Intel® TBB used. The class allows loading a desired version of the dynamic library at run time with explicit list of directories for library search. The static library provides stubs for functions and methods to resolve link-time dependencies, which are then dynamically substituted with the proper functions and methods from a loaded Intel® TBB library. All instances of class runtime loader in the same module (i.e. exe or dll) share certain global state. The most noticeable piece of this state is the loaded Intel® TBB library. The implications of that are: Only one Intel® TBB library per module can be loaded. If one runtime_loader instance has already loaded a library, another one created by the same module will not load another one. If the loaded library is suitable for the second instance, both will use it cooperatively, otherwise an error will be reported (details below). If different versions of the library are requested by different modules, those can be loaded, but may result in processor oversubscription. runtime_loader objects are not thread-safe and may work incorrectly if used concurrently. NOTE: If an application or a library uses runtime_loader, it should be linked with one of the above specified libraries instead of a normal Intel® TBB library. Example #define TBB_PREVIEW_RUNTIME_LOADER 1 #include "tbb/runtime_loader.h" #include "tbb/parallel_for.h” #include char const * path[] = { "c:\\myapp\\lib\\ia32", NULL }; int main() { tbb::runtime_loader loader( path ); if( loader.status()!=tbb::runtime_loader::ec_ok ) return -1; // The loader does not impact how TBB is used tbb::parallel_for(0, 10, ParallelForBody()); return 0; 348 315415-014US } In this example, the Intel® Threading Building Blocks (Intel®) library will be loaded from the c:\myapp\lib\ia32 directory. No explicit requirements for a version are specified, so the minimal suitable version is the version used to compile the example, and any higher version is suitable as well. If the library is successfully loaded, it can be used in the normal way. D.2.1 runtime_loader Class Summary Class for run time control over the loading of an Intel® Threading Building Blocks dynamic library. Syntax class runtime_loader; Members namespace tbb { class runtime_loader { // Error codes. enum error_code { ec_ok, // No errors. ec_bad_call, // Invalid function call. ec_bad_arg, // Invalid argument passed. ec_bad_lib, // Invalid library found. ec_bad_ver, // The library found is not suitable. ec_no_lib // No library found. }; // Error mode constants. enum error_mode { em_status, // Save status of operation and continue. em_throw, // Throw an exception of error_code type. em_abort // Print message to stderr, and abort(). }; runtime_loader( error_mode mode = em_abort ); runtime_loader( char const *path[], // List of directories to search in. int min_ver = TBB_INTERFACE_VERSION, // Minimal suitable version int max_ver = INT_MAX, // Maximal suitable version References Reference Manual 349 error_mode mode = em_abort // Error mode for this instance. ); ~runtime_loader(); error_code load( char const * path[], int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX ); error_code status(); }; } D.2.1.1 runtime_loader( error_mode mode = em_abort ) Effects Initialize runtime_loader but do not load a library. D.2.1.2 runtime_loader(char const * path[], int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX, error_mode mode = em_abort ) Requirements The last element of path[] must be NULL. Effects Initialize runtime_loader and load Intel® TBB (see load() for details). If error mode equals to em_status, the method status() can be used to check whether the library was loaded or not. If error mode equals to em_throw, in case of a failure an exception of type error_code will be thrown. If error mode equals to em_abort, in case of a failure a message will be printed to stderr, and execution aborted. D.2.1.3 error_code load(char const * path[],int min_ver = TBB_INTERFACE_VERSION, int max_ver = INT_MAX) Requirements The last element of path[] must be NULL. Effects Load a suitable version of an Intel® TBB dynamic library from one of the specified directories. 350 315415-014US TIP: The method searches for a library in directories specified in the path[] array. When a library is found, it is loaded and its interface version (as returned by TBB_runtime_interface_version()) is checked. If the version does not meet the requirements specified by min_ver and max_ver, the library is unloaded. The search continues in the next specified path, until a suitable version of the Intel® TBB library is found or the array of paths ends with NULL. It is recommended to use default values for min_ver and max_ver. CAUTION: For security reasons, avoid using relative directory names such as current ("."), parent ("..") or any other relative directory (like "lib") when searching for a library. Use only absolute directory names (as shown in the example above); if necessary, construct absolute names at run time. Neglecting these rules may cause your program to execute 3-rd party malicious code. (See http://www.microsoft.com/techne 48H t/security/advisory/2269637.mspx for details.) Returns ec_ok – a suitable version was successfully loaded. ec_bad_call - this runtime_loader instance has already been used to load a library. ec_bad_lib - A library was found but it appears invalid. ec_bad_arg - min_ver and/or max_ver is negative or zero, or min_ver > max_ver. ec_bad_ver - unsuitable version has already been loaded by another instance. ec_no_lib - No suitable version was found. D.2.1.4 error_code status() Returns If error mode is em_status, the function returns status of the last operation. D.3 parallel_ deterministic _reduce Template Function Summary Computes reduction over a range, with deterministic split/join behavior. Syntax template References Reference Manual 351 Value parallel_deterministic_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction, [, task_group_context& group] ); template void parallel_deterministic_reduce( const Range& range, const Body& body [, task_group_context& group] ); Header #define TBB_PREVIEW_DETERMINISTIC_REDUCE 1 #include "tbb/parallel_reduce.h" Description The parallel_deterministic_reduce template is very similar to the parallel_reduce template. It also has the functional and imperative forms and has similar requirements for Func and Reduction (Table 12) and Body ( 1146H Table 13). 1147H Unlike parallel_reduce, parallel_deterministic_reduce has deterministic behavior with regard to splits of both Body and Range and joins of the bodies. For the functional form, it means Func is applied to a deterministic set of Ranges, and Reduction merges partial results in a deterministic order. To achieve that, parallel_deterministic_reduce always uses simple_partitioner 49H because other partitioners may react on random work stealing behaviour (see 4.3.1). So the template 1148H declaration does not have a partitioner argument. parallel_deterministic_reduce always invokes Body splitting constructor for each range splitting. b0 [0,20) b0 [0,10) b2 [10,20) b0 [0,5) b1 [5,10) b2 [10,15) b3 [15,20) Figure 18: Execution of parallel_deterministic_reduce over blocked_range(0,20,5) As a result, parallel_deterministic_reduce recursively splits a range until it is no longer divisible, and creates a new body (by calling Body splitting constructor) for each new subrange. Likewise parallel_reduce, for each body split the method join is invoked in order to merge the results from the bodies. Figure 18 shows the execution 1149H352 315415-014US of parallel_deterministic_reduce over a sample range, with the slash marks (/) denoting where new instances of the body were created. Therefore for given arguments parallel_ deterministic_reduce executes the same set of split and join operations no matter how many threads participate in execution and how tasks are mapped to the threads. If the user-provided functions are also deterministic (i.e. different runs with the same input result in the same output), then multiple calls to parallel_deterministic_reduce will produce the same result. Note however that the result might differ from that obtained with an equivalent sequential (linear) algorithm. CAUTION: Since simple_partitioner 50H is always used, be careful to specify an appropriate grainsize (see simple_partitioner 51H class). Complexity If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads. Example The example from parallel_reduce 52H section can be easily modified to use parallel_deterministic_reduce. It is sufficient to define TBB_PREVIEW_DETERMINISTIC_REDUCE macro and rename parallel_reduce to parallel_deterministic_reduce; a partitioner, if any, should be removed to prevent compilation error. A grain size may need to be specified for blocked_range if performance suffered. #define TBB_PREVIEW_DETERMINISTIC_REDUCE 1 #include #include #include "tbb/parallel_reduce.h" #include "tbb/blocked_range.h" using namespace tbb; float ParallelSum( float array[], size_t n ) { size_t grain_size = 1000; return parallel_deterministic_reduce( blocked_range( array, array+n, grain_size ), 0.f, [](const blocked_range& r, float value)->float { return std::accumulate(r.begin(),r.end(),value); }, References Reference Manual 353 std::plus() ); } D.4 Scalable Memory Pools Memory pools allocate and free memory from a specified region or underlying allocator providing thread-safe, scalable operations. Table 52 summarizes the memory pool 1150H concept. Here, P represents an instance of the memory pool class. Table 52: Memory Pool Concept Pseudo-Signature Semantics ~P() throw(); Destructor. Frees all the memory of allocated objects. void P::recycle(); Frees all the memory of allocated objects. void* P::malloc(size_t n); Returns pointer to n bytes allocated from memory pool. void P::free(void* ptr); Frees memory object specified via ptr pointer. void* P::realloc(void* ptr, size_t n); Reallocates memory object pointed by ptr to n bytes. Model Types Template class memory_pool (D.4.1) and class 1151H fixed_pool (D.4.2) model the Memory 1152H Pool concept. D.4.1 memory_pool Template Class Summary Template class for scalable memory allocation from memory blocks provided by an underlying allocator. CAUTION: If the underlying allocator refers to another scalable memory pool, the inner pool (or pools) must be destroyed before the outer pool is destroyed or recycled. Syntax template class memory_pool; 354 315415-014US Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A memory_pool allocates and frees memory in a way that scales with the number of processors. The memory is obtained as big chunks from an underlying allocator specified by the template argument. The latter must satisfy the subset of requirements described in Table 29 with 1153H allocate, deallocate, and value_type valid for sizeof(value_type)>0. A memory_pool models the Memory Pool concept described in Table 52. 1154H Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... tbb::memory_pool > my_pool(); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr); The code above provides a simple example of allocation from an extensible memory pool. Members namespace tbb { template class memory_pool : no_copy { public: memory_pool(const Alloc &src = Alloc()) throw(std::bad_alloc); ~memory_pool(); void recycle(); void *malloc(size_t size); void free(void* ptr); void *realloc(void* ptr, size_t size); }; } D.4.1.1 memory_pool(const Alloc &src = Alloc()) Effects Constructs memory pool with an instance of underlying memory allocator of type Alloc copied from src. Throws bad_alloc exception if runtime fails to construct an instance of the class. References Reference Manual 355 D.4.2 fixed_pool Class Summary Template class for scalable memory allocation from a buffer of fixed size. Syntax class fixed_pool; Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A fixed_pool allocates and frees memory in a way that scales with the number of processors. All the memory available for the allocation is initially passed through arguments of the constructor. A fixed_pool models the Memory Pool concept described in Table 52. 1155H Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} The code above provides a simple example of allocation from a fixed pool. Members namespace tbb { class fixed_pool : no_copy { public: fixed_pool(void *buffer, size_t size) throw(std::bad_alloc); ~fixed_pool(); void recycle(); void *malloc(size_t size); void free(void* ptr); void *realloc(void* ptr, size_t size); }; } 356 315415-014US D.4.2.1 fixed_pool(void *buffer, size_t size) Effects Constructs memory pool to manage the memory pointed by buffer and of size. Throws bad_alloc exception if runtime fails to construct an instance of the class. D.4.3 memory_pool_allocator Template Class Summary Template class that provides the C++ allocator interface for memory pools. Syntax template class memory_pool_allocator; Header #define TBB_PREVIEW_MEMORY_POOL 1 #include “tbb/memory_pool.h” Description A memory_pool_allocator models the allocator requirements described in Table 29 1156H except for default constructor which is excluded from the class. Instead, it provides a constructor, which links with an instance of memory_pool or fixed_pool classes, that actually allocates and deallocates memory. The class is mainly intended to enable memory pools within STL containers. Example #define TBB_PREVIEW_MEMORY_POOL 1 #include "tbb/memory_pool.h" ... typedef tbb::memory_pool_allocator pool_allocator_t; std::list my_list(pool_allocator_t( my_pool )); The code above provides a simple example of cnostruction of a container that uses a memory pool. Members namespace tbb { template class memory_pool_allocator { public: References Reference Manual 357 typedef T value_type; typedef value_type* pointer; typedef const value_type* const_pointer; typedef value_type& reference; typedef const value_type& const_reference; typedef size_t size_type; typedef ptrdiff_t difference_type; template struct rebind { typedef memory_pool_allocator other; }; memory_pool_allocator(memory_pool &pool) throw(); memory_pool_allocator(fixed_pool &pool) throw(); memory_pool_allocator(const memory_pool_allocator& src) throw(); template memory_pool_allocator(const memory_pool_allocator& src) throw(); pointer address(reference x) const; const_pointer address(const_reference x) const; pointer allocate( size_type n, const void* hint=0); void deallocate( pointer p, size_type ); size_type max_size() const throw(); void construct( pointer p, const T& value ); void destroy( pointer p ); }; template<> class memory_pool_allocator { public: typedef void* pointer; typedef const void* const_pointer; typedef void value_type; template struct rebind { typedef memory_pool_allocator other; }; memory_pool_allocator(memory_pool &pool) throw(); memory_pool_allocator(fixed_pool &pool) throw(); memory_pool_allocator(const memory_pool_allocator& src) throw(); template memory_pool_allocator(const memory_pool_allocator& src) throw(); 358 315415-014US }; template inline bool operator==( const memory_pool_allocator& a, const memory_pool_allocator& b); template inline bool operator!=( const memory_pool_allocator& a, const memory_pool_allocator& b); } D.4.3.1 memory_pool_allocator(memory_pool &pool) Effects Constructs memory pool allocator serviced by memory_pool instance pool. D.4.3.2 memory_pool_allocator(fixed_pool &pool) Effects Constructs memory pool allocator serviced by fixed_pool instance pool. D.5 Serial subset Summary A subset of the parallel algorithms is provided for modeling serial execution. Currently only a serial version of tbb::parallel_for() is available. D.5.1 tbb::serial::parallel_for() Header #define TBB_PREVIEW_SERIAL_SUBSET 1 #include “tbb/ parallel_for.h” Motivation Sometimes it is useful, for example while debugging, to execute certain parallel_for() invocations serially while having other invocations of parallel_for()executed in parallel. Description The tbb::serial::parallel_for function implements the tbb::parallel_for API using a serial implementation underneath. Users who want sequential execution of a References Reference Manual 359 certain parallel_for() invocation will need to define the TBB_PREVIEW_SERIAL_SUBSET macro before parallel_for.h and prefix the selected parallel_for() with tbb::serial::. Internally, the serial implementation uses the same principle of recursive decomposition, but instead of spawning tasks, it does recursion “for real”, i.e. the body function calls itself twice with two halves of its original range. Example #define TBB_PREVIEW_SERIAL_SUBSET 1 #include #include Foo() { // . . . tbb::serial::parallel_for( . . . ); tbb::parallel_for( . . . ); // . . . } Intel® Threading Building Blocks Design Patterns Design Patterns Document Number 323512-005US World Wide Web: http://www.intel.com Intel® Threading Building Blocks Design Patterns ii 323512-005US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/#/en_US_01 0H . Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries.* Other names and brands may be claimed as the property of others. Copyright (C) 2010 - 2011, Intel Corporation. All rights reserved. Introduction Design Patterns iii Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Revision History Version Version Information Date 1.05 Updated the Optimization Notice. 2011-Oct-27 1.04 Added Optimization Notice. 2011-Aug-1 1.02 Correct lazy initialization examples. 2010-Sep-7 1.01 Change enqueue_self to enqeue. 2010-May-25 1.00 Initial version. 2010-Apr-4 Intel® Threading Building Blocks Design Patterns iv 323512-005US Contents 11H Introduction .....................................................................................................146H 22H Agglomeration..................................................................................................247H 33H Elementwise.....................................................................................................548H 44H Odd-Even Communication ..................................................................................749H 55H Wavefront........................................................................................................850H 66H Reduction ......................................................................................................1251H 77H Divide and Conquer.........................................................................................1652H 88H GUI Thread ....................................................................................................2053H 99H Non-Preemptive Priorities.................................................................................2454H 1010H Local Serializer ...............................................................................................2755H 1111H Fenced Data Transfer ......................................................................................3156H 1212H Lazy Initialization............................................................................................3457H 1313H Reference Counting.........................................................................................3758H 1414H Compare and Swap Loop..................................................................................3959H General Re 15H ferences.............................................................................................................4160HIntroduction Design Patterns 1 1 Introduction This document is a “cookbook” of some common parallel programming patterns and how to implement them in Intel® Threading Building Blocks (Intel® TBB). A cookbook will not make you a great chef, but provides a collection of recipes that others have found useful. Like most cookbooks, this document assumes that you know how to use basic tools. The Intel® Threading Building Blocks (Intel® TBB) Tutorial is a good place to learn the basic tools. This document is a guide to which tools to use when. A design pattern description is much more than a rote coding recipe. The description of each pattern has the following format: • Problem – describes the problem to be solved. • Context – describes contexts in which the problem arises. • Forces – considerations that drive use of the pattern. • Solution – describes how to implement the pattern. • Example – presents an example implementation. Variations and examples are sometimes discussed. The code examples are intended to emphasize key points and are not full-fledged code. Examples may omit obvious const overloads of non-const methods. Much of the nomenclature and examples are adapted from Web pages created by EunGyu and Marc Snir, and the Berkeley parallel patterns wiki. See links in the General References section For brevity, some of the code examples use C++0x lambda expressions. It is straightforward, albeit sometimes tedious, to translate such lambda expressions into equivalent C++98 code. See the Section "Lambda Expressions" in the Intel® TBB tutorial on how to enable lambda expressions in the Intel® Compiler or how do the translation by hand. Intel® Threading Building Blocks Design Patterns 2 323512-005US 2 Agglomeration Problem Parallelism is so fine grained that overhead of parallel scheduling or communication swamps the useful work. Context Many algorithms permit parallelism at a very fine grain, on the order of a few instructions per task. But synchronization between threads usually requires orders of magnitude more cycles. For example, elementwise addition of two arrays can be done fully in parallel, but if each scalar addition is scheduled as a separate task, most of the time will be spent doing synchronization instead of useful addition. Forces • Individual computations can be done in parallel, but are small. For practical use of Intel® Threading Building Blocks (Intel® TBB), "small" here means less than 10,000 clock cycles. • The parallelism is for sake of performance and not required for semantic reasons. Solution Group the computations into blocks. Evaluate computations within a block serially. The block size should be chosen to be large enough to amortize parallel overhead. Too large a block size may limit parallelism or load balancing because the number of blocks becomes too small to distribute work evenly across processors. The choice of block topology is typically driven by two concerns: • Minimizing synchronization between blocks. • Minimizing cache traffic between blocks. If the computations are completely independent, then the blocks will be independent too, and then only cache traffic issues must be considered. If the loop is “small”, on the order of less than 10,000 clock cycles, then it may be impractical to parallelize at all, because the optimal agglomeration might be a single block, Agglomeration Design Patterns 3 Examples Intel® TBB loop templates such as tbb::parallel_for that take a range argument support automatic agglomeration. When agglomerating, think about cache effects. Avoid having cache lines cross between groups if possible. There may be boundary to interior ratio effects. For example, if the computations form a 2D grid, and communicate only with nearest neighbors, then the computation per block grows quadratically (with the block’s area), but the cross-block communication grows with linearly (with the block’s perimeter). Figure 1 shows four different ways to 61H agglomerate an 8×8 grid. If doing such analysis, be careful to consider that information is transferred in cache line units. For a given area, the perimeter may be minimized when the block is square with respect to the underlying grid of cache lines, not square with respect to the logical grid. + Figure 1: Four different agglomerations of an 8×8 grid. Also consider vectorization. Blocks that contain long contiguous subsets of data may better enable vectorization. For recursive computations, most of the work is towards the leaves, so the solution is to treat subtrees as a groups as shown in Figure 2. 62HIntel® Threading Building Blocks Design Patterns 4 323512-005US Figure 2: Agglomeration of a recursive computation Often such an agglomeration is achieved by recursing serially once some threshold is reached. For example, a recursive sort might solve sub-problems in parallel only if they are above a certain threshold size. Reference Ian Foster introduced the term "agglomeration" in his book Designing and Building Parallel Programs . There agglomeration is part of a four step “PCAM” design method: 1. Partitioning - break the program into the smallest tasks possible. 2. Communication – figure out what communication is required between tasks. When using Intel® TBB, communication is usually cache line transfers. Though they are automatic, understanding which ones happen between tasks helps guide the agglomeration step. 3. Agglomeration – combine tasks into larger tasks. His book has an extensive list of considerations that is worth reading. 4. Mapping – map tasks onto processors. The Intel® TBB task scheduler does this step for you. Elementwise Design Patterns 5 3 Elementwise Problem Initiate similar independent computations across items in a data set, and wait until all complete. Context Many serial algorithms sweep over a set of items and do an independent computation on each item. However, if some kind of summary information is collected, use the Reduction pattern instead. Forces No information is carried or merged between the computations. Solution If the number of items is known in advance, use tbb::parallel_for. If not, consider using tbb::parallel_do. Use agglomeration 16H if the individual computations are small relative to scheduler overheads. If the pattern is followed by a reduction 17H on the same data, consider doing the elementwise operation as part of the reduction, so that the combination of the two patterns is accomplished in a single sweep instead of two sweeps. Doing so may improve performance by reducing traffic through the memory hierarchy. Example Convolution is often used in signal processing. The convolution of a filter c and signal x is computed as: = ? - j i j i j y c x Serial code for this computation might look like: // Assumes c[0..clen-1] and x[1-clen..xlen-1] are defined for( int i=0; i(0,xlen+clen-1,1000), [=]( tbb::blocked_range r ) { int end = r.end(); for( int i=r.begin(); i!=end; ++i ) { float tmp = 0; for( int j=0; j by Eun-Gyu Kim and Marc Snir describes the pattern. Intel® Threading Building Blocks Design Patterns 8 323512-005US 5 Wavefront Problem Perform computations on items in a data set, where the computation on an item uses results from computations on predecessor items. See reference 19H for a discussion. Context The dependences between computations form an acyclic graph. Forces • Dependence constraints between items form an acyclic graph. • The number of immediate predecessors in the graph is known in advance, or can be determined some time before the last predecessor completes. Solution The solution is a parallel variant of topological sorting, using tbb::parallel_do to process items. Associate an atomic counter with each item. Initialize each counter to the number of predecessors. Invoke tbb::parallel_do to process the items that have no predessors (have counts of zero). After an item is processed, decrement the counters of its successors. If a successor's counter reaches zero, add that successor to the tbb::parallel_do via a "feeder". If the number of predecessors for an item cannot be determined in advance, treat the information "know number of predecessors" as an additional predecessor. When the number of predecessors becomes known, treat this conceptual predecessor as completed. If the overhead of counting individual items is excessive, aggregate items into blocks, and do the wavefront over the blocks. Example Below is a serial kernel for the longest common subsequence algorithm. The parameters are strings x and y with respective lengths xlen and ylen. int F[MAX_LEN+1][MAX_LEN+1]; void SerialLCS( const char* x, size_t xlen, const char* y, size_t ylen ) Wavefront Design Patterns 9 { for( size_t i=1; i<=xlen; ++i ) for( size_t j=1; j<=ylen; ++j ) F[i][j] = x[i-1]==y[j-1] ? F[i-1][j-1]+1 : max(F[i][j-1],F[i-1][j]); } The kernel sets F[i][j] to the length of the longest common subsequence shared by x[0..i-1] and y[0..j-1]. It assumes that F[0][0..ylen] and F[0..xlen][0] have already been initialized to zero. Figure 3 shows the data dependences for calculating 63H F[i][j]. Fi-1,j-1 Fi-1,j Fi,j-1 Fi,j Figure 3: Data dependences for longest common substring calculation. As Figure 4 shows, the gray diagonal depend 64H ence is the transitive closure of other dependences. Thus for parallelization purposes it is a redundant dependence that can be ignored. Fi-1,j-1 Fi-1,j Fi,j-1 Fi,j Figure 4: Diagonal dependence is redundant. It is generally good to remove redundant dependences from consideration, because the atomic counting incurs a cost for each dependence considered. Another consideration is grain size. Scheduling each F[i][j] element calculation separately is prohibitively expensive. A good solution is to aggregate the elements into contiguous blocks, and process the contents of a block serially. The blocks have the same dependence pattern, but at a block scale. Hence scheduling overheads can be amortized over blocks. The parallel code follows. Each block consists of N×N elements. Each block has an associated atomic counter. Array Count organizes these counters for easy lookup. The Intel® Threading Building Blocks Design Patterns 10 323512-005US code initializes the counters and then rolls a wavefront using parallel_do, starting with the block at the origin since it has no predecessors. const int N = 64; tbb::atomic Count[MAX_LEN/N+1][MAX_LEN/N+1]; void ParallelLCS( const char* x, size_t xlen, const char* y, size_t ylen ) { // Initialize predecessor counts for blocks. size_t m = (xlen+N-1)/N; size_t n = (ylen+N-1)/N; for( int i=0; i0)+(j>0); // Roll the wavefront from the origin. typedef pair block; block origin(0,0); tbb::parallel_do( &origin, &origin+1, [=]( const block& b, tbb::parallel_do_feeder& feeder ) { // Extract bounds on block size_t bi = b.first; size_t bj = b.second; size_t xl = N*bi+1; size_t xu = min(xl+N,xlen+1); size_t yl = N*bj+1; size_t yu = min(yl+N,ylen+1); // Process the block for( size_t i=xl; i by Eun-Gyu Kim and Marc Snir. Intel® Threading Building Blocks Design Patterns 12 323512-005US 6 Reduction Problem Perform an associative reduction operation across a data set. Context Many serial algorithms sweep over a set of items to collect summary information. Forces The summary can be expressed as an associative operation over the data set, or at least is close enough to associative that reassociation does not matter. Solution Two solutions exist in Intel® Threading Building Blocks (Intel® TBB). The choice on which to use depends upon several considerations: • Is the operation commutative as well as associative? • Are instances of the reduction type expensive to construct and destroy? For example, a floating point number is inexpensive to construct. A sparse floatingpoint matrix might be very expensive to construct. Use tbb::parallel_reduce when the objects are inexpensive to construct. It works even if the reduction operation is not commutative. The Intel® TBB Tutorial describes how to use tbb::parallel_reduce for basic reductions. Use tbb::parallel_for and tbb::combinable if the reduction operation is commutative and instances of the type are expensive. If the operation is not precisely associative but a precisely deterministic result is required, use recursive reduction and parallelize it using tbb::parallel_invoke. Examples The examples presented here illustrate the various solutions and some tradeoffs. The first example uses t tbb::parallel_reduce to do a + reduction over sequence of type T. The sequence is defined by a half-open interval [first,last). T AssocReduce( const T* first, const T* last, T identity ) { Reduction Design Patterns 13 return tbb::parallel_reduce( // Index range for reduction tbb::blocked_range(first,last), // Identity element identity, // Reduce a subrange and partial sum [&]( tbb::blocked_range r, T partial_sum )->float { return std::accumulate( r.begin(), r.end(), partial_sum ); }, // Reduce two partial sums std::plus() ); } The third and fourth arguments to this form of parallel_reduce are a built in form of the agglomeration 21H pattern. If there is an elementwise 22H action to be performed before the reduction, incorporating it into the third argument (reduction of a subrange) may improve performance because of better locality of reference. The second example assumes the + is commutative on T. It is a good solution when T objects are expensive to construct. T CombineReduce( const T* first, const T* last, T identity ) { tbb::combinable sum(identity); tbb::parallel_for( tbb::blocked_range(first,last), [&]( tbb::blocked_range r ) { sum.local() += std::accumulate(r.begin(), r.end(), identity); } ); return sum.combine( []( const T& x, const T& y ) {return x+y;} ); } Sometimes it is desirable to destructively use the partial results to generate the final result. For example, if the partial results are lists, they can be spliced together to form the final result. In that case use class tbb::enumerable_thread_specific instead of combinable. The ParallelFindCollisions 23H example in Chapter 7 demonstrates the 65H technique. Floating-point addition and multiplication are almost associative. Reassociation can cause changes because of rounding effects. The techniques shown so far reassociate terms non-deterministically. Fully deterministic parallel reduction for a not quite associative operation requires using deterministic reassociation. The code below demonstrates this in the form of a template that does a + reduction over a sequence of values of type T. template T RepeatableReduce( const T* first, const T* last, T identity ) { if( last-first<=1000 ) { // Use serial reductionIntel® Threading Building Blocks Design Patterns 14 323512-005US return std::accumulate( first, last, identity ); } else { // Do parallel divide-and-conquer reduction const T* mid = first+(last-first)/2; T left, right; tbb::parallel_invoke( [&]{left=RepeatableReduce(first,mid,identity);}, [&]{right=RepeatableReduce(mid,last,identity);} ); return left+right; } } The outer if-else is an instance of the agglomeration 24H pattern for recursive computations. The reduction graph, though not a strict binary tree, is fully deterministic. Thus the result will always be the same for a given input sequence, assuming all threads do identical floating-point rounding. The final example shows how a problem that typically is not viewed as a reduction can be parallelized by viewing it as a reduction. The problem is retrieving floating-point exception flags for a computation across a data set. The serial code might look something like: feclearexcept(FE_ALL_EXCEPT); for( int i=0; i r ) { Reduction Design Patterns 15 int end=r.end(); for( int i=r.begin(); i!=end; ++i ) C[i] = A[i]/B[i]; // It is critical to do |= here, not =, because otherwise we // might lose earlier exceptions from the same thread. flags |= fetestexcept(FE_ALL_EXCEPT); } // Called by parallel_reduce when joining results from two subranges. void join( Body& other ) { flags |= other.flags; } }; Then invoke it as follows: // Construction of cc implicitly resets FP exception state. ComputeChunk cc; tbb::parallel_reduce( tbb::blocked_range(0,N), cc ); if (cc.flags & FE_DIVBYZERO) ...; if (cc.flags & FE_OVERFLOW) ...; ... Intel® Threading Building Blocks Design Patterns 16 323512-005US 7 Divide and Conquer Problem Parallelize a divide and conquer algorithm. Context Divide and conquer is widely used in serial algorithms. Common examples are quicksort and mergesort. Forces • Problem can be transformed into subproblems that can be solved independently. • Splitting problem or merging solutions is relatively cheap compared to cost of solving the subproblems. Solution There are several ways to implement divide and conquer in Intel®Threading Building Blocks (Intel® TBB). The best choice depends upon circumstances. • If division always yields the same number of subproblems, use recursion and tbb::parallel_invoke. • If the number of subproblems varies, use recursion and tbb::task_group. • If ultimate efficiency and scalability is important, use tbb::task and continuation passing style. Example Quicksort is a classic divide-and-conquer algorithm. It divides a sorting problem into two subsorts. A simple serial version looks like:0F 1 void SerialQuicksort( T* begin, T* end ) { 1 Production quality quicksort implementations typically use more sophisticated pivot selection, explicit stacks instead of recursion, and some other sorting algorithm for small subsorts. The simple algorithm is used here to focus on exposition of the parallel pattern. Divide and Conquer Design Patterns 17 if( end-begin>1 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); SerialQuicksort( begin, mid-1 ); SerialQuicksort( mid, end ); } } The number of subsorts is fixed at two, so tbb::parallel_invoke provides a simple way to parallelize it. The parallel code is shown below: void ParallelQuicksort( T* begin, T* end ) { if( end-begin>1 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, [=]{ParallelQuicksort( mid, end );} ); } } Eventually the subsorts become small enough that serial execution is more efficient. The following variation, with changed parts in blue, does sorts of less than 500 elements using the earlier serial code. void ParallelQuicksort( T* begin, T* end ) { if( end-begin>=500 ) { using namespace std; T* mid = partition( begin+1, end, bind2nd(less(),*begin) ); swap( *begin, mid[-1] ); tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );}, [=]{ParallelQuicksort( mid, end );} ); } else { SerialQuicksort( begin, end ); } } The change is an instance of the Agglomeration 25H pattern. The next example considers a problem where there are a variable number of subproblems. The problem involves a tree-like description of a mechanical assembly. There are two kinds of nodes: • Leaf nodes represent individual parts. • Internal nodes represent groups of parts. The problem is to find all nodes that collide with a target node. The following code shows a serial solution that walks the tree. It records in Hits any nodes that collide with Target. Intel® Threading Building Blocks Design Patterns 18 323512-005US std::list Hits; Node* Target; void SerialFindCollisions( Node& x ) { if( x.is_leaf() ) { if( x.collides_with( *Target ) ) Hits.push_back(&x); } else { for( Node::const_iterator y=x.begin(); y!=x.end(); ++y ) SerialFindCollisions(*y); } } A parallel version is shown below. typedef tbb::enumerable_thread_specific > LocalList; LocalList LocalHits; Node* Target; // Target node void ParallelWalk( Node& x ) { if( x.is_leaf() ) { if( x.collides_with( *Target ) ) LocalHits.local().push_back(&x); } else { // Recurse on each child y of x in parallel tbb::task_group g; for( Node::const_iterator y=x.begin(); y!=x.end(); ++y ) g.run( [=]{ParallelWalk(*y);} ); // Wait for recursive calls to complete g.wait(); } } void ParallelFindCollisions( Node& x ) { ParallelWalk(x); for(LocalList::iterator i=LocalHits.begin(); i!=LocalHits.end(); ++i) Hits.splice( Hits.end(), *i ); } The recursive walk is parallelized using class task_group to do recursive calls in parallel. There is another significant change because of the parallelism that is introduced. Because it would be unsafe to update Hits concurrently, the parallel walk uses variable LocalHits to accumulate results. Because it is of type enumerable_thread_specific, each thread accumulates its own private result. The results are spliced together into Hits after the walk completes. The results will not be in the same order as the original serial code. Divide and Conquer Design Patterns 19 If parallel overhead is high, use the agglomeration 26H pattern. For example, use the serial walk for subtrees under a certain threshold. Intel® Threading Building Blocks Design Patterns 20 323512-005US 8 GUI Thread Problem A user interface thread must remain responsive to user requests, and must not get bogged down in long computations. Context Graphical user interfaces often have a dedicated thread (“GUI thread”) for servicing user interactions. The thread must remain responsive to user requests even while the application has long computations running. For example, the user might want to press a “cancel” button to stop the long running computation. If the GUI thread takes part in the long running computation, it will not be able to respond to user requests. Forces • The GUI thread services an event loop. • The GUI thread needs to offload work onto other threads without waiting for the work to complete. • The GUI thread must be responsive to the event loop and not become dedicated to doing the offloaded work. Related Non-Preemptive Priorities 27H Local Serializer 28H Solution The GUI thread offloads the work by firing off a task to do it using method task::enqueue. When finished, the task posts an event to the GUI thread to indicate that the work is done. The semantics of enqueue cause the task to eventually run on a worker thread distinct from the calling thread. The method is a new feature in Intel® Threading Building Blocks (Intel® TBB) 3.0. Figure 5 sketches the communication paths. 66H Items in black are executed by the GUI thread; items in blue are executed by another thread. GUI Thread Design Patterns 21 message loop task::enqueue post event task::execute Figure 5: GUI Thread pattern Example The example is for the Microsoft Windows* operating systems, though similar principles apply to any GUI using an event loop idiom. For each event, the GUI thread calls a user-defined function WndProc. to process an event. The key parts are in bold font. // Event posted from enqueued task when it finishes its work. const UINT WM_POP_FOO = WM_USER+0; // Queue for transmitting results from enqueued task to GUI thread. tbb::concurrent_queue ResultQueue; // GUI thread’s private copy of most recently computed result. Foo CurrentResult; LRESULT CALLBACK WndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam) { switch(msg) { case WM_COMMAND: switch (LOWORD(wParam)) { case IDM_LONGRUNNINGWORK: // User requested a long computation. Delegate it to another thread. LaunchLongRunningWork(hWnd); break; case IDM_EXIT: DestroyWindow(hWnd); break; default: return DefWindowProc(hWnd, msg, wParam, lParam); } break; case WM_POP_FOO: // There is another result in ResultQueue for me to grab. ResultQueue.try_pop(CurrentResult); // Update the window with the latest result. RedrawWindow( hWnd, NULL, NULL, RDW_ERASE|RDW_INVALIDATE ); break; Intel® Threading Building Blocks Design Patterns 22 323512-005US case WM_PAINT: Repaint the window using CurrentResult break; case WM_DESTROY: PostQuitMessage(0); break; default: return DefWindowProc( hWnd, msg, wParam, lParam ); } return 0; } The GUI thread processes long computations as follows: 1. The GUI thread calls LongRunningWork, which hands off the work to a worker thread and immediately returns. 2. The GUI thread continues servicing the event loop. If it has to repaint the window, it uses the value of CurrentResult, which is the most recent Foo that it has seen. When a worker finishes the long computation, it pushes the result into ResultQueue, and sends a message WM_POP_FOO to the GUI thread. 3. The GUI thread services a WM_POP_FOO message by popping an item from ResultQueue into CurrentResult. The try_pop always succeeds because there is exactly one WM_POP_FOO message for each item in ResultQueue. Routine LaunchLongRunningWork creates a root task and launches it using method task::enqeueue. The task is a root task because it has no successor task waiting on it. class LongTask: public tbb::task { HWND hWnd; tbb::task* execute() { Do long computation Foo x = result of long computation ResultQueue.push( x ); // Notify GUI thread that result is available. PostMessage(hWnd,WM_POP_FOO,0,0); return NULL; } public: LongTask( HWND hWnd_ ) : hWnd(hWnd_) {} }; void LaunchLongRunningWork( HWND hWnd ) { LongTask* t = new( tbb::task::allocate_root() ) LongTask(hWnd); tbb::task::enqueue(*t); } GUI Thread Design Patterns 23 It is essential to use method task::enqueue and not method task::spawn. The reason is that method enqueue ensures that the task eventually executes when resources permit, even if no thread explicitly waits on the task. In contrast, method spawn may postpone execution of the task until it is explicitly waited upon. The example uses a concurrent_queue for workers to communicate results back to the GUI thread. Since only the most recent result matters in the example, and alternative would be to use a shared variable protected by a mutex. However, doing so would block the worker while the GUI thread was holding a lock on the mutex, and vice versa. Using concurrent_queue provides a simple robust solution. If two long computations are in flight, there is a chance that the first computation completes after the second one. If displaying the result of the most recently requested computation is important, then associate a request serial number with the computation. The GUI thread can pop from ResultQueue into a temporary variable, check the serial number, and update CurrentResult only if doing so advances the serial number. See Non-Preemptive Priorities 29H for how to implement priorities. See Local Serializer 30H for how to force serial ordering of certain tasks. Intel® Threading Building Blocks Design Patterns 24 323512-005US 9 Non-Preemptive Priorities Problem Choose the next work item to do, based on priorities. Context The scheduler in Intel® Threading Building Blocks (Intel® TBB) chooses tasks using rules based on scalability concerns. The rules are based on the order in which tasks were spawned or enqueued, and are oblivious to the contents of tasks. However, sometimes it is best to choose work based on some kind of priority relationship. Forces • Given multiple work items, there is a rule for which item should be done next that is not the default Intel® TBB rule. • Preemptive priorities are not necessary. If a higher priority item appears, it is not necessary to immediately stop lower priority items in flight. If preemptive priorities are necessary, then non-preemptive tasking is inappropriate. Use threads instead. Solution Put the work in a shared work pile. Decouple tasks from specific work, so that task execution chooses the actual piece of work to be selected from the pile. Example The following example implements three priority levels. The user interface for it and top-level implementation follow: enum Priority { P_High, P_Medium, P_Low }; template void EnqueueWork( Priority p, Func f ) { WorkItem* item = new ConcreteWorkItem( p, f ); Non-Preemptive Priorities Design Patterns 25 ReadyPile.add(item); } The caller provides a priority p and a functor f to routine EnqueueWork. The functor may be the result of a lambda expression. EnqueueWork packages f as a WorkItem and adds it to global object ReadyPile. Class WorkItem provides a uniform interface for running functors of unknown type: // Abstract base class for a prioritized piece of work. class WorkItem { public: WorkItem( Priority p ) : priority(p) {} // Derived class defines the actual work. virtual void run() = 0; const Priority priority; }; template class ConcreteWorkItem: public WorkItem { Func f; /*override*/ void run() { f(); delete this; } public: ConcreteWorkItem( Priority p, const Func& f_ ) : WorkItem(p), f(f_) {} }; Class ReadyPile contains the core pattern. It maintains a collection of work and fires off tasks that choose work from the collection: class ReadyPileType { // One queue for each priority level tbb::concurrent_queue level[P_Low+1]; public: void add( WorkItem* item ) { level[item->priority].push(item); tbb::task::enqueue(*new(tbb::task::allocate_root()) RunWorkItem); } void runNextWorkItem() { // Scan queues in priority order for an item. WorkItem* item=NULL; for( int i=P_High; i<=P_Low; ++i ) if( level[i].try_pop(item) ) break; assert(item); item->run(); } Intel® Threading Building Blocks Design Patterns 26 323512-005US }; ReadyPileType ReadyPile; The task enqueued by add(item) does not necessarily execute that item. The task executes runNextWorkItem(), which may find a higher priority item. There is one task for each item, but the mapping resolves when the task actually executes, not when it is created. Here are the details of class RunWorkItem: class RunWorkItem: public tbb::task { /*override*/tbb::task* execute(); // Private override of virtual method }; ... tbb::task* RunWorkItem::execute() { ReadyPile.runNextWorkItem(); return NULL; }; RunWorkItem objects are fungible. They enable the Intel® TBB scheduler to choose when to do a work item, not which work item to do. The override of virtual method task::execute is private because all calls to it are dispatched via base class task. Other priority schemes can be implemented by changing the internals for ReadyPileType. A priority queue could be used to implement very fine grained priorities. The scalability of the pattern is limited by the scalability of ReadyPileType. Ideally scalable concurrent containers should be used for it. Local Serializer Design Patterns 27 10 Local Serializer Context Consider an interactive program. To maximize concurrency and responsiveness, operations requested by the user can be implemented as tasks. The order of operations can be important. For example, suppose the program presents editable text to the user. There might be operations to select text and delete selected text. Reversing the order of “select” and “delete” operations on the same buffer would be bad. However, commuting operations on different buffers might be okay. Hence the goal is to establish serial ordering of tasks associated with a given object, but not constrain ordering of tasks between different objects. Forces • Operations associated with a certain object must be performed in serial order. • Serializing with a lock would be wasteful because threads would be waiting at the lock when they could be doing useful work elsewhere. Solution Sequence the work items using a FIFO (first-in first-out structure). Always keep an item in flight if possible. If no item is in flight when a work item appears, put the item in flight. Otherwise, push the item onto the FIFO. When the current item in flight completes, pop another item from the FIFO and put it in flight. The logic can be implemented without mutexes, by using concurrent_queue for the FIFO and atomic to count the number of items waiting and in flight. The example explains the accounting in detail. Example The following example builds on the Non-Preemptive Priorities example 31H to implement local serialization in addition to priorities. It implements three priority levels and local serializers. The user interface for it follows: enum Priority { P_High, P_Medium, P_Low }; Intel® Threading Building Blocks Design Patterns 28 323512-005US template void EnqueueWork( Priority p, Func f, Serializer* s=NULL ); Template function EnqueueWork causes functor f to run when the three constraints in Table 1 are met. 67H Table 1: Implementation of Constraints Constraint Resolved by class... Any prior work for the Serializer has completed. Serializer A thread is available. RunWorkItem No higher priority work is ready to run. ReadyPileType Constraints on a given functor are resolved from top to bottom in the table. The first constraint does not exist when s is NULL. The implementation of EnqueueWork packages the functor in a SerializedWorkItem and routes it to the class that enforces the first relevant constraint between pieces of work. template void EnqueueWork( Priority p, Func f, Serializer* s=NULL ) { WorkItem* item = new SerializedWorkItem( p, f, s ); if( s ) s->add(item); else ReadyPile.add(item); } A SerializedWorkItem is derived from a WorkItem, which serves as a way to pass around a prioritized piece of work without knowing further details of the work. // Abstract base class for a prioritized piece of work. class WorkItem { public: WorkItem( Priority p ) : priority(p) {} // Derived class defines the actual work. virtual void run() = 0; const Priority priority; }; template class SerializedWorkItem: public WorkItem { Serializer* serializer; Func f; /*override*/ void run() { f(); Serializer* s = serializer; // Destroy f before running Serializer’s next functor. delete this; if( s ) Local Serializer Design Patterns 29 s->noteCompletion(); } public: SerializedWorkItem( Priority p, const Func& f_, Serializer* s ) : WorkItem(p), serializer(s), f(f_) {} }; Base class WorkItem is the same as class WorkItem 32H in the example 33H for Non-Preemptive Priorities. The notion of serial constraints is completely hidden from the base class, thus permitting the framework to extend other kinds of constraints or lack of constraints. Class SerializedWorkItem is essentially ConcreteWorkItem 34H from the other example, extended with a Serializer aspect. Virtual method run() is invoked when it becomes time to run the functor. It performs three steps: 1. Run the functor 2. Destroy the functor. 3. Notify the Serializer that the functor completed, and thus unconstraining the next waiting functor. Step 3 is the difference from the operation of ConcreteWorkItem::run 35H . Step 2 could be done after step 3 in some contexts to increase concurrency slightly. However, the presented order is recommended because if step 2 takes non-trivial time, it likely has side effects that should complete before the next functor runs. Class Serializer implements the core of the Local Serializer pattern: class Serializer { tbb::concurrent_queue queue; tbb::atomic count; // Count of queued items and in-flight item void moveOneItemToReadyPile() { // Transfer item from queue to ReadyPile WorkItem* item; queue.try_pop(item); ReadyPile.add(item); } public: void add( WorkItem* item ) { queue.push(item); if( ++count==1 ) moveOneItemToReadyPile(); } void noteCompletion() { // Called when WorkItem completes. if( --count!=0 ) moveOneItemToReadyPile(); } }; Intel® Threading Building Blocks Design Patterns 30 323512-005US The class maintains two members: • A queue of WorkItem waiting for prior work to complete. • A count of queued or in-flight work. Mutexes are avoided by using concurrent_queue and atomic along with careful ordering of operations. The transitions of count are the key understanding how class Serializer works. • If method add increments count from 0 to 1, this indicates that no other work is in flight and thus the work should be moved to the ReadyPile. • If method noteCompletion decrements count and it is not from 1 to 0, then the queue is non-empty and another item in the queue should be moved to ReadyPile. Class ReadyPile 36H is explained in the example 37H for Non-Preemptive Priorities. If priorities are not necessary, there are two variations on method moveOneItem, with different implications. • Method moveOneItem could directly invoke item->run(). This approach has relatively low overhead and high thread locality for a given Serializer. But it is unfair. If the Serializer has a continual stream of tasks, the thread operating on it will keep servicing those tasks to the exclusion of others. • Method moveOneItem could invoke task::enqueue to enqueue a task that invokes item->run(). Doing so introduces higher overhead and less locality than the first approach, but avoids starvation. The conflict between fairness and maximum locality is fundamental. The best resolution depends upon circumstance. The pattern generalizes to constraints on work items more general than those maintained by class Serializer. A generalized Serializer::add determines if a work item is unconstrained, and if so, runs it immediately. A generalized Serializer::noteCompletion runs all previously constrained items that have become unconstrained by the completion of the current work item. The term “run” means to run work immediately, or if there are more constraints, forwarding the work to the next constraint resolver. Fenced Data Transfer Design Patterns 31 11 Fenced Data Transfer Problem Write a message to memory and have another processor read it on hardware that does not have a sequentially consistent memory model. Context The problem normally arises only when unsynchronized threads concurrently act on a memory location, or are using reads and writes to create synchronization. High level synchronization constructs normally include mechanisms that prevent unwanted reordering. Modern hardware and compilers can reorder memory operations in a way that preserves the order of a thread's operation from its viewpoint, but not as observed by other threads. A serial common idiom is to write a message and mark it as ready to ready as shown in the following code: bool Ready; std::string Message; void Send( const std::string& src ) { // Executed by thread 1 Message=src; Ready = true; } bool Receive( std::string& dst ) { // Executed by thread 2 bool result = Ready; if( result ) dst=Message; return result; // Return true if message was received. } Two key assumptions of the code are: a. Ready does not become true until Message is written. b. Message is not read until Ready becomes true. These assumptions are trivially true on uniprocessor hardware. However, they may break on multiprocessor hardware. Reordering by the hardware or compiler can cause the sender's writes to appear out of order to the receiver (thus breaking condition a) or the receiver's reads to appear out of order (thus breaking condition b). Intel® Threading Building Blocks Design Patterns 32 323512-005US Forces • Creating synchronization via raw reads and writes. Related Lazy Initialization 38H Solution Change the flag from bool to tbb::atomic for the flag that indicates when the message is ready. Here is the previous example, with modifications colored blue. tbb::atomic Ready; std::string Message; void Send( const std::string& src ) { // Executed by thread 1 Message=src; Ready = true; } bool Receive( std::string& dst ) { // Executed by thread 2 bool result = Ready; if( result ) dst=Message; return result; // Return true if message was received. } A write to a tbb::atomic value has release semantics, which means that all of its prior writes will be seen before the releasing write. A read from tbb::atomic value has acquire semantics, which means that all of its subsequent reads will happen after the acquiring read. The implementation of tbb::atomic ensures that both the compiler and the hardware observe these ordering constraints. Variations Higher level synchronization constructs normally include the necessary acquire and release fences. For example, mutexes are normally implemented such that acquisition of a lock has acquire semantics and release of a lock has release semantics. Thus a thread that acquires a lock on a mutex always sees any memory writes done by another thread before it released a lock on that mutex. Non Solutions Mistaken solutions are so often proposed that it is worth understanding why they are wrong. Fenced Data Transfer Design Patterns 33 One common mistake is to assume that declaring the flag with the volatile keyword solves the problem. Though the volatile keyword forces a write to happen immediately, it generally has no effect on the visible ordering of that write with respect to other memory operations. An exception to this rule are processors from the Intel® Itanium® processor family, which by convention assign acquire semantics to volatile reads and release semantics to volatile writes. Another mistake is to assume that conditionally executed code cannot happen before the condition is tested. However, the compiler or hardware may speculatively hoist the conditional code above the condition. Similarly, it is a mistake to assume that a processor cannot read the target of a pointer before reading the pointer. A modern processor does not read individual values from main memory. It reads cache lines. The target of a pointer may be in a cache line that has already been read before the pointer was read, thus giving the appearance that the processor presciently read the pointer target. Intel® Threading Building Blocks Design Patterns 34 323512-005US 12 Lazy Initialization Problem Perform an initialization the first time it is needed. Context Initializing data structures lazily is a common technique. Not only does it avoid the cost of initializing unused data structures, it is often a more convenient way to structure a program. Forces • Threads share access to an object. • The object should not be created until the first access. The second force covers several possible motivations: • The object is expensive to create and creating it early would slow down program startup. • It is not used in every run of the program. • Early initialization would require adding code where it is undesirable for readability or structural reasons. Related Fenced Data Transfer 39H Solutions A parallel solution is substantially trickier, because it must deal with several concurrency issues. Races: If two threads attempt to simultaneously access to the object for the first time, and thus cause creation of the object, the race must be resolved in a way that both threads end up with a reference to the same object of type T. Memory leaks: In the event of a race, the implementation must ensure that any extra transient T objects are cleaned up. Lazy Initialization Design Patterns 35 Memory consistency: If thread X executes value=new T(), all other threads must see stores by new T() occur before the assignment value= . Deadlock: What if the constructor of T() requires acquiring a lock, but the current holder of that lock is also racing to access the object for the first time? There are two solutions. One is based on double-check locking. The other relies on compare-and-swap. Because the tradeoffs and issues are subtle, most of the discussion is in the following examples section. Examples An Intel® TBB implementation of the “double-check” pattern is shown below: template class lazy { tbb::atomic value; Mutex mut; public: lazy() : value() {} // Initializes value to NULL ~lazy() {delete value;} T& get() { if( !value ) { // Read of value has acquire semantics. Mutex::scoped_lock lock(mut); if( !value ) value = new T(); // Write of value has release semantics } return *value; } }; The name comes from the way that the pattern deals with races. There is one check done without locking and one check done after locking. The first check handles the presumably common case that the initialization has already been done, without any locking. The second check deals with cases where two threads both see an uninitialized value, and both try to acquire the lock. In that case, the second thread to acquire the lock will see that the initialization has already occurred. If T() throws an exception, the solution is correct because value will still be NULL and the mutex unlocked when object lock is destroyed. The solution correctly addresses memory consistency issues. A write to a tbb::atomic value has release semantics, which means that all of its prior writes will be seen before the releasing write. A read from tbb::atomic value has acquire semantics, which means that all of its subsequent reads will happen after the acquiring read. Both of these properties are critical to the solution. The releasing write ensures that the construction of T() is seen to occur before the assignment to value. The acquiring read ensures that when the caller reads from *value, the reads occur after the "if(!value)" check. The release/acquire is essentially the Fenced Data Transfer 40HIntel® Threading Building Blocks Design Patterns 36 323512-005US pattern, where the “message” is the fully constructed instance T(), and the “ready” flag is the pointer value. The solution described involves blocking threads while initialization occurs. Hence it can suffer the usual pathologies associated with blocking. For example, if the thread first acquires the lock is suspended by the OS, all other threads will have to wait until that thread resumes. A lock-free variation avoids this problem by making all contending threads attempt initialization, and atomically deciding which attempt succeeds. An Intel® TBB implementation of the non-blocking variant follows. It also uses doublecheck, but without a lock. template class lazy { tbb::atomic value; public: lazy() : value() {} // Initializes value to NULL ~lazy() {delete value;} T& get() { if( !value ) { T* tmp = new T(); if( value.compare_and_swap(tmp,NULL)!=NULL ) // Another thread installed the value, so throw away mine. delete tmp; } return *value; } }; The second check is performed by the expression value.compare_and_swap(tmp,NULL)!=NULL, which conditionally assigns value=tmp if value==NULL, and returns true if the old value was NULL. Thus if multiple threads attempt simultaneous initialization, the first thread to execute the compare_and_swap will set value to point to its T object. Other contenders that execute the compare_and_swap will get back a non-NULL pointer, and know that they should delete their transient T objects. As with the locking solution, memory consistency issues are addressed by the semantics of tbb::atomic. The first check has acquire semantics and the compare_and_swap has both acquire and release semantics. Reference A sophisticated way to avoid the acquire fence for a read is Mike Burrow's algorithm . Reference Counting Design Patterns 37 13 Reference Counting Problem Destroy an object when it will no longer be used. Context Often it is desirable to destroy an object when it is known that it will not be used in the future. Reference counting is a common serial solution that extends to parallel programming if done carefully. Forces • If there are cycles of references, basic reference counting is insufficient unless the cycle is explicitly broken. • Atomic counting is relatively expensive in hardware. Solution Thread-safe reference counting is like serial reference counting, except that the increment/decrement is done atomically, and the decrement and test "count is zero?" must act as a single atomic operation. The following example uses tbb::atomic to achieve this. template class counted { tbb::atomic my_count; T value; public: // Construct object with a single reference to it. counted() {my_count=1;} // Add reference void add_ref() {++my_count;} // Remove reference. Return true if it was the last reference. bool remove_ref() {return --my_count==0;} // Get reference to underlying object T& get() { assert(my_count>0); return my_value; } Intel® Threading Building Blocks Design Patterns 38 323512-005US }; It is incorrect to use a separate read for testing if the count is zero. The following code would be an incorrect implementation of method remove_ref() because two threads might both execute the decrement, and then both read my_count as zero. Hence two callers would both be told incorrectly that they had removed the last reference. --my_count; return my_count==0; // WRONG! The decrement may need to have a release fence so that any pending writes complete before the object is deleted. There is no simple way to atomically copy a pointer and increment its reference count, because there will be a timing hole between the copying and the increment where the reference count is too low, and thus another thread might decrement the count to zero and delete the object. Two way to address the problem are “hazard pointers” and “pass the buck”. See the references at the end of this chapter for details. Variations Atomic increment/decrement can more than an order of magnitude more expensive than ordinary increment/decrement. The serial optimization of eliminating redundant increment/decrement operations becomes more important with atomic reference counts. Weighted reference counting can be used to reduce costs if the pointers are unshared but the referent is shared. Associate a weight with each pointer. The reference count is the sum of the weights. A pointer x can be copied as a pointer x' without updating the reference count by splitting the original weight between x and x'. If the weight of x is too low to split, then first add a constant W to the reference count and weight of x. References D. Bacon and V.T. Rajan, “Concurrent Cycle Collection in Reference Counted Systems” in Proc. European Conf. on Object-Oriented Programming (June 2001). Describes a garbage collector based on reference counting that does collect cycles. M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects” in IEEE Transactions on Parallel and Distributed Systems (June 2004). Describes the “hazard pointer” technique. M. Herlihy, V. Luchangco, and M. Moir, “The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, Lock-Free Data Structures” in Proceedings of the 16th International Symposium on Distributed Computing (Oct. 2002). Describes the “pass the buck” technique. Compare and Swap Loop Design Patterns 39 14 Compare and Swap Loop Problem Atomically update a scalar value so that a predicate is satisfied. Context Often a shared variable must be updated atomically, by a transform that maps its old value to a new value. The transform might be a transition of a finite state machine, or recording global knowledge. For instance, the shared variable might be recording the maximum value that any thread has seen so far. Forces • The variable is read and updated by multiple threads. • The hardware implements “compare and swap” for a variable of that type. • Protecting the update with a mutex is to be avoided. Related Reduction 41H Reference counting 42H Solution The solution is to atomically snapshot the current value, and then use atomic::compare_and_swap to update it. Retry until the compare_and_swap succeeds. In some cases it may be possible to exit before the compare_and_swap succeeds because the current value meets some condition. The template below does the update x=F(x) atomically. // Atomically perform x=F(x). template void AtomicUpdate( atomic& x, F f ) { int o; do { // Take a snapshot int o = x; // Attempt to install new value computed from snapshotIntel® Threading Building Blocks Design Patterns 40 323512-005US } while( x.compare_and_swap(o,f(o))!=o ); } It is critical to take a snapshot and use it for intermediate calculations, because the value of X may be changed by other threads in the meantime. The following code shows how the template might be used to maintain a global maximum of any value seen by RecordMax. // Atomically perform UpperBound = max(UpperBound,y) void RecordMax( int y ) { extern atomic UpperBound; AtomicUpdate(UpperBound, [&](int value){return std::max(value,y);} ); } When y is not going to increase UpperBound, the call to AtomicUpdate will waste time doing the redundant operation compare_and_swap(o,o). In general, this kind of redundancy can be eliminated by making the loop in AtomicUpdate exit early if F(o)==o. In this particular case where F==std::max, that test can be further simplified. The following custom version of RecordMax has the simplified test. // Atomically perform UpperBound =max(UpperBound,y) void RecordMax( int y ) { . extern atomic UpperBound; do { // Take a snapshot int o = UpperBound; // Quit if snapshot meets condition. if( o>=y ) break; // Attempt to install new value. } while( UpperBound.compare_and_swap(y,o)!=o ); } Because all participating threads modify a common location, the performance of a compare and swap loop can be poor under high contention. Thus the applicability of more efficient patterns should be considered first. In particular: • If the overall purpose is a reduction, use the reduction 43H pattern instead. • If the update is addition or subtraction, use atomic::fetch_and_add. If the update is addition or subtraction by one, use atomic::operater++ or atomic::operator--. These methods typically employ direct hardware support that avoids a compare and swap loop. CAUTION: If use compare_and_swap to update links in a linked structure, be sure you understand if the “ABA problem” is an issue. See the Internet for discourses on the subject. Compare and Swap Loop Design Patterns 41 General References This section lists general references. References specific to a pattern are listed at the end of the chapter for the pattern. • E. Gamma, R. Helm, R. Johnson, J. Vlissides. Design Patterns (1995). • Berkeley Pattern Language for Parallel Programming, http://parlab.eecs.berkeley.edu/wiki/patterns 44H • T. Mattson, B. Sanders, B. Massingill. Patterns for Parallel Programming (2005). • ParaPLoP 2009, http://www.upcrc.illinois.edu/workshops/paraplop09/program.html 45H • ParaPLoP 2010, http://www.upcrc.illinois.edu/workshops/paraplop10/program.html • Eun-Gyu Kim and Marc Snir, “Parallel Programming Patterns”, http://www.cs.illinois.edu/homes/snir/PPP/index.html Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 1 Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes Document number: 321604-003US 24 July 2011 Table of Contents 1 Introduction......................................................................................................................... 2 1.1 Change History ............................................................................................................ 2 1.2 Product Contents ......................................................................................................... 2 1.3 What’s New.................................................................................................................. 2 1.4 System Requirements.................................................................................................. 3 1.5 Documentation............................................................................................................. 4 1.6 Samples....................................................................................................................... 5 1.7 Technical Support........................................................................................................ 5 2 Installation........................................................................................................................... 5 2.1 Pre-Installation Steps................................................................................................... 5 2.1.1 Configure Microsoft Visual Studio for 64-bit Applications ...................................... 5 2.1.2 Installation on Microsoft Windows Vista* or Windows 7*....................................... 6 2.2 Installation ................................................................................................................... 6 2.2.1 Activation of Purchase after Evaluation Using the Intel Activation Tool ................. 6 2.3 Installation Folders....................................................................................................... 7 2.4 Installation Known Issues............................................................................................. 7 2.4.1 Installation Path Too Long or Filename Too Long................................................. 7 2.4.2 Additional Steps to Install Documentation for Microsoft Visual Studio 2010 .......... 7 2.4.3 Error Message "HelpLibAgent.exe has stopped working" When Uninstalling Intel Parallel Studio 2011............................................................................................................ 8 2.4.4 Unicode Characters in License Path..................................................................... 8 2.4.5 Documentation Issue with Multiple Visual Studio Versions.................................... 8 3 Disclaimer and Legal Information........................................................................................ 9Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 2 1 Introduction This document describes system requirements and how to install Intel® Parallel Studio 2011 SP1. Additional release notes for each component, with details of changes and additional technical information, can be found after installation, in the respective components’ Documentation folder. First-time users should read “Intel® Parallel Studio Getting Started” by clicking on the “Getting Started” link at the lower left of the install window, or read The Intel® Parallel Studio Getting Started Tutorial that is available after installation at Start > All Programs > Intel Parallel Studio 2011 > Getting Started > Parallel Studio Getting Started Tutorial. 1.1 Change History This section highlights important changes in product updates. Update 2 ? Intel® Parallel Composer 2011 Update 3 ? Intel® Parallel Amplifier 2011 Update 2 ? Intel® Parallel Advisor 2011 Update 2 ? Intel® Parallel Inspector 2011 Update 2 ? Corrections to reported problems Update 1 ? Intel® Parallel Composer 2011 Update 1 ? Intel® Parallel Amplifier 2011 Update 1 ? Intel® Parallel Advisor 2011 Update 1 ? Intel® Parallel Inspector 2011 Update 1 ? Corrections to reported problems Product Release ? Initial product release 1.2 Product Contents Intel® Parallel Studio 2011 SP1 includes the following components: ? Intel® Parallel Composer 2011 Update 6 ? Intel® Parallel Inspector 2011 Update 3 ? Intel® Parallel Amplifier 2011 Update 3 ? Intel® Parallel Advisor 2011 Update 3 ? Integration into Microsoft* development environments ? Sample programs ? On-disk documentation 1.3 What’s New For details on what is new in the product components, please see the individual components’ release notes.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 3 1.4 System Requirements For an explanation of architecture names, see http://software.intel.com/en-us/articles/intelarchitecture-platform-terminology/ ? A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium 4 processor or later, or compatible non-Intel processor) o Incompatible or proprietary instructions in non-Intel processors may cause the analysis capabilities of this product to function incorrectly. Any attempt to analyze code not supported by Intel® processors may lead to failures in this product. o For the best experience, a multi-core or multi-processor system is recommended ? 2GB RAM ? 4GB free disk space for all product features and architectures ? Microsoft Windows XP*, Microsoft Windows Vista*, Microsoft Windows 7* - 32-bit or “x64” editions, or Microsoft Windows HPC Server 2008* “x64” edition only - embedded editions of any of these operating systems are not supported ? When installed on Microsoft Windows Server 2008, one of: o Microsoft Visual Studio 2010* with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2008* Standard Edition (or higher edition) SP1 with C++ and “x64 Compiler and Tools” components installed [1] ? When installed on Microsoft Windows XP, Windows Vista or Windows Server 2003, one of: o Microsoft Visual Studio 2010* with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2008* Standard Edition (or higher edition) with C++ and “x64 Compiler and Tools” components installed [1] o Microsoft Visual Studio 2005* Standard Edition (or higher edition) with C++ and “x64 Compiler and Tools” components installed [1] ? Application coding requirements: o Programming Language: C or C++ (native, not managed code) [4] o Threading methodologies supported by the analysis tools: ? Intel® Cilk™ Plus ? Intel® Threading Building Blocks ? Win32* Threads ? OpenMP* [4] ? To read the on-disk documentation, Adobe Reader* 7.0 or later Notes: 1. Microsoft Visual Studio 2005 and 2008 Standard Edition installs the “x64 Compiler and Tools” component by default – the Professional and higher editions require a “Custom” install to select this. Microsoft Visual Studio 2010 includes x64 support by default. 2. The default for the Intel® compilers is to build IA-32 architecture applications that require a processor supporting the Intel® SSE2 instructions - for example, the Intel® Pentium® Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 4 4 processor. A compiler option is available to generate code that will run on any IA-32 architecture processor. However, if your application uses Intel® Integrated Performance Primitives or Intel® Threading Building Blocks, executing the application will require a processor supporting the Intel® SSE2 instructions. 3. Applications built with Intel® Parallel Composer can be run on the same Windows versions as specified above for development. Applications may also run on nonembedded 32-bit versions of Microsoft Windows earlier than Windows XP, though Intel does not test these for compatibility. Your application may depend on a Win32 API routine not present in older versions of Windows. You are responsible for testing application compatibility. You may need to copy certain run-time DLLs onto the target system to run your application. 4. The analysis tools support analysis of applications built with Intel® Parallel Composer, Intel® C++ Compiler version 10.0 or higher, and/or Microsoft Visual C++ 2005, 2008 or 2010. Applications that use OpenMP and are built with the Microsoft compiler must link to the OpenMP “compatibility library” as supplied by an Intel compiler. 1.5 Documentation Product documentation for each component of Intel® Parallel Studio SP1 can be found in the component’s folder. In addition, “Getting Started” documentation can be found in the Documentation folder under Parallel Studio 2011. Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 5 manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20110307 1.6 Samples A series of samples to help introduce you to Intel® Parallel Studio 2011 SP1 can be found in the Samples folder. The samples are provided as a ZIP archive which should be unpacked to a writable folder of your choice. Each component has additional samples under its respective folder. 1.7 Technical Support If you did not register your product during installation, please do so at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates and upgrades for the duration of the support term. For information about how to find Technical Support, Product Updates, User Forums, FAQs, tips and tricks, and other support information, please visit http://www.intel.com/software/products/support Note: If your distributor provides technical support for this product, please contact them for support rather than Intel. 2 Installation 2.1 Pre-Installation Steps 2.1.1 Configure Microsoft Visual Studio for 64-bit Applications If you are using Microsoft Visual Studio 2005* or 2008 and will be developing 64-bit applications (for the Intel® 64 architecture) you may need to change the configuration of Visual Studio to add 64-bit support.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 6 If you are using Visual Studio 2005/2008 Standard Edition, or Visual Studio 2010 Professional Edition or higher, no configuration is needed to build Intel® 64 architecture applications. For other editions: 1. From Control Panel > Add or Remove Programs, select “Microsoft Visual Studio 2005” (or 2008) > Change/Remove. The Visual Studio Maintenance Mode window will appear. Click Next. 2. Click Add or Remove Features 3. Under “Select features to install”, expand Language Tools > Visual C++ 4. If the box “X64 Compiler and Tools” is not checked, check it, then click Update. If the box is already checked, click Cancel. 2.1.2 Installation on Microsoft Windows Vista* or Windows 7* On Microsoft Windows Vista or Windows 7, Microsoft Visual Studio 2005 users should install Visual Studio 2005 Service Pack 1 (VS 2005 SP1) as well as the Visual Studio 2005 Service Pack 1 Update for Windows Vista, which is linked to from the VS 2005 SP1 page. After installing these updates, you must ensure that Visual Studio runs with Administrator permissions, otherwise you will be unable to use the Intel compiler. For more information, please see Microsoft's Visual Studio on Windows Vista page (http://msdn2.microsoft.com/enus/vstudio/aa948853.aspx) and related documents. 2.2 Installation The installation of the product requires a valid license file or serial number. If you are evaluating the product, you can also choose the “Evaluate this product (no serial number required)” option during installation. To begin installation, insert the first product DVD in your computer’s DVD-ROM drive; the installation should start automatically. If it does not, open the top-level folder of the DVD-ROM drive in Windows Explorer and double-click on setup.exe. If you received your product as a downloadable file, double-click on the executable file (.EXE) to begin installation. You do not need to uninstall previous versions or updates before installing a newer version – the new version will replace the older version. 2.2.1 Activation of Purchase after Evaluation Using the Intel Activation Tool Note for evaluation customers a new tool Intel Activation Tool “ActivationTool.exe” is included in this product release and installed at “[Common Files]\Intel\Parallel Studio 2011\Activation\”. If you installed the product using an Evaluation license or SN, or using the “Evaluate this product (no serial number required)” option during installation, and then purchased the product, you can activate your purchase using the Intel Activation Tool at Start > All Programs > Intel Parallel Studio 2011 > Product Activation. It will convert your evaluation software to a fully licensed product.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 7 2.3 Installation Folders The product installs into a folder arrangement as shown below. Not all folders will be present in a given installation. If other Intel® Parallel Studio tools are installed, they will share the top-level installation folder. ? C:\Program Files\Intel\Parallel Studio 2011\ o Documentation o Samples o Advisor o Amplifier o Composer SP1 o Inspector If you are installing on a system with a non-English language version of Windows, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (X86) or the equivalent. 2.4 Installation Known Issues 2.4.1 Installation Path Too Long or Filename Too Long During installation, if the length of the full installation path of any installed file including the filename exceeds 256 characters, the installation will stop with an error message. One possible error message is: Error 1304. Error writing to file: d:\Program Files\Development Tools\Intel\Parallel Studio 2011\Composer\Documentation\en_US\ipp\ipp_manual\IPPI\ippi_ch16\functn _YCrCb411ToYCbCr422_EdgeDV_YCrCb411ToYCbCr422_ZoomOut2_EdgeDV_YCrCb411 ToYCbCr422_ZoomOut4_EdgeDV_YCrCb411ToYCbCr422_ZoomOut8_EdgeDV.htm This can occur because the user has specified a long custom installation root directory. Try shortening this path if you run into this error. Note that this may require reinstallation of other Parallel Studio 2011 SP1 products. 2.4.2 Additional Steps to Install Documentation for Microsoft Visual Studio 2010 When installing Intel Parallel Studio 2011 SP1 on a system with Microsoft Visual Studio 2010 for the first time, you will be asked to initialize the “Local Store” for documentation for Visual Studio 2010 if it was not done before. The "Help Library Manager" will register the Intel Parallel Studio 2011 SP1 help documentation within Visual Studio 2010. Please follow the instructions of the "Help Library Manager" installation wizard to install the Intel Parallel Studio 2011 SP1 help documentation for Visual Studio 2010. This step is only needed once. When you install Intel Parallel Studio updates in the future, you will not be required to re-register the documentation through the “Help Library Manager”.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 8 2.4.3 Error Message "HelpLibAgent.exe has stopped working" When Uninstalling Intel Parallel Studio 2011 When installing or uninstalling Intel Parallel Studio 2011 SP1 on a system with Visual Studio 2010, you may see the error message “HelpLibAgent.exe has stopped working”. This error does not prevent the installation or uninstallation of Intel Parallel Studio. It is an issue from a 3rd party tool. When there is a fix, the Release Notes will be updated. Please visit http://software.intel.com/en-us/articles/installation-error-helplibagentexe-has-stopped-workingwhen-uninstalling-intel-parallel-studio-2011/ for the latest update on this issue. 2.4.4 Unicode Characters in License Path During installation, Intel® software cannot handle Unicode characters in license paths and the names of the licenses. Intel® software tries to find licenses in the standard location (%CommonProgramFiles%\Intel\Licenses, most commonly C:\Program Files\Common Files\Intel\Licenses on 32-bit systems and C:\Program Files (x86)\Common Files\Intel\Licenses on 64-bit systems). Do not place licenses in folders or paths containing localized characters. For example: C:\????\?????. Do not rename licenses obtained from Intel using localized characters. For example ???????.lic. Do not set the INTEL_LICENSE_FILE environment variable to contain directory paths and license names containing localized characters. Keep licenses either in the standard location (see above), or use ASCII characters in directory names and license names. For example: C:\Intel\Licenses and License.lic. 2.4.5 Documentation Issue with Multiple Visual Studio Versions If you have both Microsoft Visual Studio* 2005 and 2008 installed on your system and integrate Intel® Parallel Studio 2011 SP1 into both versions, removing the integration from one of the versions will remove the integrated Intel® Parallel Studio documentation from both. To re-install the documentation: For Intel® Parallel Composer 2011: 1. Use the Control Panel to select the product. ? For Windows XP* users: Select Control Panel > Add/Remove Programs. ? For Windows 7* users: Select Control Panel > Programs and Features. ? For Windows Vista* users: Select Control Panel > Programs. 2. With the product selected, click the Change/Remove button and choose Modify mode. 3. In the Select Components dialog box, unselect “Integrated Documentation;” this will remove the documentation. 4. Repeat steps 1 and 2. 5. In the Select Components dialog box, select “Integrated Documentation” to install documentation again For Intel® Parallel Advisor 2011, Intel® Parallel Amplifier 2011, Intel® Parallel Inspector 2011 : Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 9 First option: 1. Open the Intel® Parallel Studio command prompt (Start Menu\Programs\Intel Parallel Studio 2011\Command Prompt. You can choose any shortcut here, for example, “IA-32 Visual Studio 2005 mode”). 2. Remove the integration for the Visual Studio version that is missing integrated help. For example: ? “ampl-vsreg –d 2005” for removing the Amplifier integration with VS2005 ? “insp-vsreg –d 2008” for removing the Inspector integration with VS2008 ? “advi-vsreg –d 2005” for removing the Advisor integration with VS2005 3. Restore the integration. For example: ? “ampl-vsreg –i 2005” for adding the Amplifier integration with VS2005 ? “insp-vsreg –i 2008” for adding the Inspector integration with VS2008 ? “advi-vsreg –i 2005” for adding the Advisor integration with VS2005 Second option: 1. Uninstall the product. 2. Install it again with the desired Visual Studio integration selected. 3 Disclaimer and Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes 10 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Celeron, Centrino, Intel, Intel logo, Intel386, Intel486, Intel Atom, Intel Core, Itanium, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright © 2011 Intel Corporation. All Rights Reserved. Intel® Math Kernel Library Vector Statistical Library Notes Document Number: 310714-023US Copyright © 2003–2011, Intel Corporation. All Rights Reservedii Contents 1 Legal Information ................................................................................................................ 1 2 Revision History................................................................................................................... 2 3 About This Library................................................................................................................ 3 4 About This Document ........................................................................................................... 4 4.1 Conventions ............................................................................................................. 5 5 Introduction ........................................................................................................................ 6 6 Randomness and Scientific Experiment................................................................................... 7 7 Random Numbers ................................................................................................................ 8 8 Figures of Merit for Random Number Generators...................................................................... 9 8.1 Uniform Probability Distribution and Basic Pseudo- and Quasi-Random Number Generators ............................................................................................................... 9 8.2 Figures of Merit for General (Non-Uniform) Distribution Generators ............................... 10 9 VSL Structure.................................................................................................................... 12 9.1 Why Vector Type Generators?................................................................................... 12 9.2 Basic Generators..................................................................................................... 13 9.3 Random Streams and RNGs in Parallel Computation .................................................... 18 9.3.1 Initializing Basic Generator.......................................................................... 18 9.3.2 Creating and Initializing Random Streams..................................................... 19 9.3.3 Creating Random Stream Copy and Copying Stream State.............................. 20 9.3.4 Saving and Restoring Random Streams ........................................................ 20 9.3.5 Independent Streams. Leapfrogging and Block-Splitting ................................. 21 9.3.6 Abstract Basic Random Number Generators. Abstract Streams ........................ 23 9.4 Generating Methods for Random Numbers of Non-Uniform Distribution.......................... 29 9.4.1 Inverse Transformation .............................................................................. 29 9.4.2 Acceptance/Rejection ................................................................................. 30 9.4.3 Mixture of Distributions .............................................................................. 31 9.4.4 Special Properties ...................................................................................... 31 9.5 Accurate and Fast Modes of Random Number Generation ............................................. 32 9.6 Example of VSL Use ................................................................................................ 33 10 Testing of Basic Random Number Generators ........................................................................ 36 10.1 BRNG Implementations and Categories...................................................................... 36 10.1.1 First Category .......................................................................................... 36 10.1.2 Second Category ...................................................................................... 37 10.1.3 Third Category ......................................................................................... 37 10.2 Interpreting Test Results.......................................................................................... 37 10.2.1 One-Level (Threshold) Testing.................................................................... 37 10.2.2 Two-Level Testing..................................................................................... 38 10.3 BRNG Tests Description ........................................................................................... 38 10.3.1 3D Spheres Test....................................................................................... 38 10.3.2 Birthday Spacing Test ............................................................................... 39 10.3.3 Bitstream Test.......................................................................................... 41 10.3.4 Rank of 31x31 Binary Matrices Test ............................................................ 42 10.3.5 Rank of 32x32 Binary Matrices Test ............................................................ 44 10.3.6 Rank of 6x8 Binary Matrices Test................................................................ 45 10.3.7 Count-the-1's Test (Stream of Bits) ............................................................ 47 10.3.8 Count-the-1's Test (Stream of Specific Bytes) .............................................. 49 10.3.9 Craps Test............................................................................................... 50 10.3.10 Parking Lot Test ..................................................................................... 51 10.3.11 2D Self-Avoiding Random Walk Test.......................................................... 52 10.3.12 Template Test ........................................................................................ 53 10.4 BRNG Properties and Testing Results ......................................................................... 54 10.4.1 MCG31m1 ............................................................................................... 54Contents iii 10.4.2 R250....................................................................................................... 56 10.4.3 MRG32k3a............................................................................................... 58 10.4.4 MCG59.................................................................................................... 60 10.4.5 WH ......................................................................................................... 62 10.4.6 MT19937................................................................................................. 64 10.4.7 SFMT19937 ............................................................................................. 66 10.4.8 MT2203................................................................................................... 68 10.4.9 SOBOL .................................................................................................... 70 10.4.10 NIEDERREITER ....................................................................................... 74 11 Testing of Distribution Random Number Generators ............................................................... 78 11.1 Interpreting Test Results.......................................................................................... 78 11.2 Description of Distribution Generator Tests................................................................. 78 11.2.1 Confidence Test........................................................................................ 79 11.2.2 Distribution Moments Test ......................................................................... 79 11.2.3 Chi-Squared Goodness-of-Fit Test .............................................................. 80 11.2.4 Performance ............................................................................................ 80 11.3 Continuous Distribution Functions ............................................................................. 81 11.3.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFORM_STD_ACCURATE) ..... 82 11.3.2 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER)....................................... 82 11.3.3 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) ..................................... 82 11.3.4 Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) ............................................... 83 11.3.5 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) ............................... 84 11.3.6 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) ............................. 84 11.3.7 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) ....................................... 84 11.3.8 Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_EXPONENTIAL_ICDF_ACCURAT E) ............................................................................................................ 85 11.3.9 Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) ..................................................... 85 11.3.10 Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/ VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) .................................................. 85 11.3.11 Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) ................................................... 85 11.3.12 Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/ VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) ................................................ 86 11.3.13 Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/VSL_RNG_METHOD_LOGNORMAL_BOXMULLER2_ACCURATE) .................. 86 11.3.14 Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF)................................................... 86 11.3.15 Gamma (VSL_RNG_METHOD_GAMMA_GNORM/ VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) ................................................... 87 11.3.16 Beta (VSL_RNG_METHOD_BETA_CJA/ VSL_RNG_METHOD_BETA_CJA_ACCURATE) ... 87 11.4 Discrete Distribution Functions.................................................................................. 88 11.4.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD) .................................................... 88 11.4.2 UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) ........................................ 88 11.4.3 UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) ................................. 91 11.4.4 UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) ................................. 92 11.4.5 Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF).............................................. 92 11.4.6 Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF)............................................ 93 11.4.7 Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE)................................................ 93 11.4.8 Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) ........................... 93 11.4.9 Poisson (VSL_RNG_METHOD_POISSON_PTPE) ................................................... 93 11.4.10 Poisson (VSL_RNG_METHOD_POISSON_POISNORM) .......................................... 94 11.4.11 PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) ....................................... 94 11.4.12 NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) ................................... 94 Bibliography............................................................................................................................... 951 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright © 2003-2011, Intel Corporation. All rights reserved.2 Revision History Revision Number Description Revision Date 1.0 Original version of the VSL Notes. Documents Intel® Math Kernel Library release 6.0 Gold. 02/03 2.0 Documents Intel® Math Kernel Library release 6.0 Gold + minor corrections 03/03 3.0 Documents Intel MKL release 6.1 Gold. 07/03 4.0 Documents Intel MKL release 7.0 Beta. 11/03 5.0 Documents Intel MKL release 7.0 Gold. 04/04 6.0 Documents Intel MKL release 7.0.1. 07/04 7.0 Documents Intel MKL release 8.0 Beta. 03/05 8.0 Documents Intel MKL release 8.0 Gold. 08/05 -009 Documents Intel MKL release 8.1 Gold. 03/06 -010 Documents Intel MKL release 9.0 Beta. 05/06 -011 Documents Intel MKL release 9.0 Gold. 09/06 -012 Documents Intel MKL release 9.1 Beta. 01/07 -013 Documents Intel MKL release 9.1 Gold. 03/07 -014 Documents Intel MKL release 10.0 Beta. 07/07 -015 Documents Intel MKL release 10.0 Gold. 09/07 -016 Documents Intel MKL release 10.1 Beta. 04/08 -017 Documents Intel MKL release 10.1 Gold. 08/08 -018 Documents Intel MKL release 10.2 Beta. 01/09 -019 Documents Intel MKL release 10.2. 04/09 -020 Documents Intel MKL release 10.3. 08/10 -021 Documents Intel MKL release 10.3.3. 02/11 -022 Documents Intel MKL release 10.3.5. 07/11 -023 Documents Intel MKL release 10.3.7. 10/113 1 About This Library Vector Statistical Library (VSL) is designed for the purpose of pseudorandom and quasi-random vector generation and for convolution and correlation mathematical operations. VSL is an integral part of Intel® Math Kernel Library (Intel® MKL). VSL provides a number of generator subroutines implementing commonly used continuous and discrete distributions, all of which are based on the highly optimized Basic Random Number Generators (BRNGs) and VML, the library of vector transcendental functions, to help improve their performance. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #201108044 2 About This Document Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 This document includes a brief conceptual overview of random numbers generation problems, the product and its capabilities, with focus on interpretation of results and the related generator figures of merit as well as task-oriented, procedural, and reference information. In contrast to the Intel MKL Reference Manual, VSL Notes substantially expand on the concept of random number generation and its application as well as on the related notions and issues. The document provides extensive comparative analysis of the library generators and describes the basic tests applied. Apart from the VSL distribution generators and service subroutines, dealt with in the Intel MKL Reference Manual, the VSL Notes also describe testing of distribution generators. Those interested in general issues related to random number generators, their quality and applications in computer simulation should refer to Randomness and Scientific Experiment, Random Numbers, and Figures of Merit for Random Number Generators sections, which briefly cover the relevant matters and provide references for further studies. VSL Structure section covers the concept underlying VSL, the library structure and potential for functionality enhancement. VSL is a library of high-performance random number generators. The section describes the factors that optimize the VSL generators for Intel® processors. Special attention is given to VSL ease of use and other advantages in parallel programming. The Testing of Basic Random Number Generators and Testing of Distribution Random Number Generators sections describe a number of tests for the VSL generators of various probability distributions. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for latest test results.5 2.1 Conventions The following mathematical notation is used throughout the document: Bitwise exclusive OR. & Bitwise AND. | Bitwise OR.6 3 Introduction This document does not purport to cover the fundamentals of the mathematical statistics and probability theory, nor those of the theory of numbers and statistical simulation. Books and articles listed in the Bibliography section mostly cover these issues. What you will find below is a brief overview of issues pertaining to random number generation, interpretation of the results and the related notion of quality random number generation. To some extent, it is an attempt to justify 'the fall' of many people engaged in solving problems of randomness simulation, that is, the fall John von Neumann meant, when he wrote: "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin". Still more and more researchers in a variety of scientific fields are getting themselves involved into this kind of simulation depravity, as simulation is becoming more and more valuable in various scientific disciplines. Computer simulation has become a new and defacto commonly recognized approach to scientific research along with conventional experimentation. The latter harshly restricts a mathematical model that is supposed to be as sophisticated as the available conventional research methods permit. As for computer simulation, with ever-growing computing power the degree of mathematical model complexity has come to be more dependable exclusively on our own understanding of phenomena we try to model. This is arguably the key factor in ensuring the great success that computer simulation has achieved of recent.7 4 Randomness and Scientific Experiment A precise definition of what the word 'random' means can hardly be given, even considering the fact that everyday life provides a variety of examples of 'randomness'. Randomness is closely related to unpredictability of observation results and impossibility to predict them with sufficient accuracy. The nature of randomness is based on lack of exhaustive information about the phenomenon under observation. As soon as we learn the origin of that phenomenon, we no longer consider it accidental or random. On the other hand, a random phenomenon, whose origin has been revealed, loses nothing of its random character. We may characterize randomness as a type of relation stipulated by conditions that are inessential, superfluous, and extraneous to this particular phenomenon. Thus, knowledge is incomplete by definition as it is impossible to allow for all sorts of immaterial relations. Since our knowledge is incomplete (and it is something that can hardly be helped), the observation results may prove impossible to predict with great accuracy. For instance, the initial state of the objects under observation may change imperceptibly for our instruments, but these small changes may cause significant alterations in the final results. Sophisticated nature of the observed phenomenon may make accurate computation impossible in practice, if not in theory. Finally, even minor uncontrollable disturbing factors may cause serious deviations from hypothetically "true value". Nevertheless, with all likelihood of "irregularities" and "deviations", observational or experimental results still reveal a certain typical regularity, named statistical stability. Various forms of statistical stability are formulated as specific rules that mathematical statistics calls laws of large numbers. In fact, it is this stability that the mathematical theory underlying the mathematical model of random phenomena is based upon. This theory is well known as the theory of probability.8 5 Random Numbers A set of distinctive features characterizes experimental observations. Many of such features are of purely quantitative nature (results of measurements, calculations, and the like) but some of them are mainly qualitative (for example, color of the object, occurrence or non-occurrence, and so on). In the latter case results may also be presented as quantitative if some appropriate conventions have been developed and applied (this may prove to be a rather tricky task to accomplish though). Thus, even when the result is a particular quality feature it can be expressed by a certain number, which, if the result is a random phenomenon, is called a random number. Numerical methods consider random numbers not only as data from experimental observations. After emergence of computers an imitation of a huge amount of random numbers is of great interest in various computational areas as well [Knuth81]. For historical reasons, methods that utilize random numbers to perform a simulation of phenomena are called Monte Carlo methods. Monte Carlo became a tool to perform the most complex simulations in natural and social sciences, financial analysis, physics of turbulence, rarefied gas and fluid simulations, physics of high energies, chemical kinetics and combustion, radiation transport problems, and photorealistic rendering. Monte Carlo methods are intended for various numerical problems such as solving ordinary stochastic differential equations, ordinary differential equations with random entries, boundary value problems for partial differential equations, integral equations, and evaluation of high-dimensional integrals including path-dependent integrals. Monte Carlo methods include also random variables and order statistics simulation, stochastic processes as well as random samplings and permutations. Due to various reasons [Brat87], random number generation based on completely deterministic algorithms has become most common. It is obvious, however, that numbers obtained in a strictly deterministic way can not be considered truly random as they only imitate randomness and are, in fact, pseudo-random. Ideally, pseudo-random numbers imitate 'truly' random ones so well that without knowing the method of pseudo-random number generation and judging only by the output sequence, it is impossible to distinguish it within a reasonable time from a 'truly' random sequence with more than 50% probability [L’Ecu94]. The output sequence of most pseudorandom number generators is easily predictable. This is acceptable because a number of practical applications do not require strict unpredictability. However, there are certain applications for which most now existing pseudorandom generators are useless and at times simply dangerous. Among them, for example, are applications dealing with geometrical behavior of large random vectors. Most of presently existing generators should never be used for cryptographic purposes. Pseudorandom number generators imitate finite sequences of independent identically distributed (i.i.d.) random numbers. However, some numerical methods do not really require independence between random numbers in a sequence. For such methods (a numerical integration and optimization, for example) the most important is to fill some space with numbers as close to a given distribution as possible to the prejudice of independence. Such sequences do not look random at all. For historical reasons they are called quasi-random (or low discrepancy) sequences, respective generators are called quasi-random number generators, and Monte Carlo methods dealing with quasi-random numbers are called Quasi-Monte Carlo methods. Hereinafter, the term 'random number generator', or RNG, refers to both pseudo- and quasi-random number generators, unless we want to emphasize the fact that a generator produces precisely a pseudo- or quasi-random sequence.9 6 Figures of Merit for Random Number Generators Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 This section discusses figures of merit for Basic Pseudo- and Quasi-Random Number Generators as well as for General (Non-Uniform) Distribution Generators. 6.1 Uniform Probability Distribution and Basic Pseudo- and Quasi-Random Number Generators When considering a great variety of probability distributions, special emphasis should be laid upon a uniform distribution over a certain set U of large cardinality. Firstly, such a distribution is most convenient for analysis. And secondly, a random number generator of uniform distribution can always serve as a basis for an RNG of any other distribution type. That is why we use the term basic generators in reference to pseudorandom number generators of uniform distribution. So the observational output sequence of a basic generator should ideally possess the same properties as a sequence of independent variates evenly distributed over a set U, that is, it should be able to pass various statistical tests for uniformity and independence. A pseudorandom number generator, however, is unable to pass all sorts of statistical tests, as it is an a priori fact that the output sequence of such generator is anything but random. In other words, a fairly powerful statistical test can always be created for any individual basic RNG, which the said generator will definitely fail. The situation may not look so desperate, if we consider the time required to detect 'non-randomness' in the generator. It makes sense to consider only those statistical tests that work within a 'reasonable' period of time. What exactly time period is 'reasonable'? No direct answer is possible here, as it depends on the sphere of generator application. For example, 'reasonable' time in cryptography may be measured in years of testing conducted on a powerful cluster, while it may be significantly shorter for most of other applications. Note: As of present, VSL contains general-purpose random number generators that are not intended for cryptography applications. Cryptographic RNGs are too slow for other fields; most of applications there benefit from simpler (and faster) generators: linear congruential, multiple recursive, feedback-shift-register, add-with-carry, etc. To summarize, it should be noted that checking the quality of basic RNGs requires a 'reasonable' set, or battery, of statistical tests. Ideally, such tests depend for their choice on types of problems the generator is intended to solve. A suitable test battery for general-purpose RNGs libraries is fairly hard to choose, as the tests it should include are supposed to be versatile and sufficient for many simulation tasks. DIEHARD Battery of Tests by G. Marsaglia [Mars95] is an example of a good set of Intel(R) MKL Vector Statistical Library Notes 10 empirical tests for basic generators. Still a specific application type may require a more complete generator testing. While duly recognizing the importance and usefulness of empirical testing, we should emphasize the significance of theoretical methods for estimating the quality of basic generators. Theoretical research serves as the basis for better understanding of generator’s properties: its period length, lattice structure, discrepancy, equidistribution, etc. Theoretic evaluation is the first stage in rejecting admittedly bad generators. Empirical tests should be applied only to make sure the remaining generators are of acceptable quality. What makes the empirical testing just as important is the fact that most of results obtained with the help of theoretical testing refer to a basic generator used over the entire period, while in practice only a small fraction of the period is (and should be!) engaged. Good behavior of k-dimensional random number vectors over the entire period provides us with greater confidence (yet not with a proof) that similarly good statistical behavior will be observed over a smaller portion of the period [L’Ecu94]. Period of a basic generator is a most important feature that characterizes its quality. For example, one of the VSL BRNGs - multiplicative congruential generator MCG31m1 - has a period length of about 2 31 , while its efficiency amounts to about four processor cycles per one real number, using Intel® Itanium® 2 processor. Therefore, with the processor frequency of 1GHz, the entire period will be covered within slightly more than 2 seconds. Taking into consideration that good statistical behavior of the generator is observed only over a fraction of its period (B.D. Ripley [Ripley87] recommends to take no more than a square root of the period length) we may assert that such period length is unacceptable. Such generators, however, still may be useful in certain Monte Carlo applications (mostly due to the speed and small volume of memory engaged to keep the generator state as well as efficient methods available for generation of random subsequences), when a relatively little quantity of random numbers should be used. For example, while estimating a global solution to an integral equation through Monte Carlo method, the same random numbers should be used for different parameters [Mikh2000]. Somehow or other, modern computational capacities require BRNGs of at least 2 60 period length. All the other VSL BRNGs meet these requirements. Pseudorandom number generators are commonly recursive integer sequences in modular arithmetic, for example: Theoretical research aims at selection of such values for parameters k, ai, m that provide for good quality properties of the output sequence in terms of period length, lattice structure, discrepancy, equidistribution, etc. In particular, if m is a prime number, and with proper coefficients ai selected, a period length of order mk may be obtained. Nevertheless, m is often taken as 2p (p >1) due to efficient modulo m reduction. Some authors do not recommend using m in the form of a power of 2 (see, for example, D. Knuth [Knuth81], P. L’Ecuyer [L’Ecu94]) as the lower bits of the generated random numbers prove to be non-random on the whole. For most of Monte Carlo applications, however, this is immaterial. Moreover, even if m is a prime number, great care should also be taken when selecting random bits in the output sequence. For the same reasons quasi-random number generators filling some hypercube as evenly as possible are called in VSL as Basic Random Number Generators as well. Quasi-random sequences filling space according to a non-uniform distribution can be generated by transforming a sequence produced by a basic quasi-random number generator. It is obvious that in most cases tests designed for pseudorandom number generators cannot be used for quasi-random number generators. Special batteries of tests should be designed for basic quasi-random number generators. 6.2 Figures of Merit for General (Non-Uniform) Distribution Generators First and foremost, it should be noted that a general distribution generator greatly depends on the quality of the underlying BRNG. Several basic approaches may be singled out to test general distribution generators. Random number distributions can be described with a number of measures: probability moments, central and absolute moments, quantiles, mode, scattering, skewness, and excess (kurtosis) coefficients, etc. All the ordinary sample characteristics converge in probability to the corresponding Figures of Merit for Random Number Generators 11 measures of distribution when the sample size tends to infinity [Cram46]. Commonly, the characteristics based on the distribution moments are asymptotically normal with large sample sizes. Some classes of sample characteristics that are not based on sampling moments are also asymptotically normal, while others have quite different asymptotic behavior. Somehow or other, when limit probability distribution is known, it is possible to build a statistical test to check whether a particular sample characteristic agrees with a corresponding measure of the distribution. Of greatest practical value for simulation purposes are sample mean and variance that are main properties of the distribution bias and scattering. All the VSL random number generators undergo testing for agreement between distribution sampling moments (mean and variance) and theoretical values calculated for various sample sizes and distribution parameters. Another class of valuable tests aims to check how well the sample distribution function agrees with the theoretical one. The most important tests among them are chi-square Pearson goodness-of-fit test (for discrete and continuous distributions) and Kolmogorov-Smirnov goodness-of-fit test (for continuous distributions). Every VSL distribution is tested with chi-square Pearson test over various sample sizes and distribution parameters. It may be useful to transform the sequence that is being tested into one of the distributions, for example, into a uniform, normal, or multidimensional normal distribution. Then the transformed sequence is tested using a set of statistical tests that are specific for the distribution to which the sequence was transformed. Tests that are based on simulation are in fact real Monte Carlo applications. Their choice is quite optional and should be made in accordance with the generator’s field of application, the only requirement being an opportunity to verify the results obtained against the theoretical value. A good example of such test application, which is used in checking the VSL generators for quality, is the selfavoiding random walk [Ziff98].12 7 VSL Structure Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 The VSL library of the current Intel MKL version contains a set of generators to create general probability distributions, most commonly used in simulations, such as uniform, normal (Gaussian), exponential, Poisson, etc. Non-uniform distributions are generated using various transformation techniques applied to the output of a basic (either pseudo-random or quasi-random) RNG. To generate random numbers of a given probability distribution, you have an option of choosing one of the available VSL basic generators or of registering your own basic random number generator. To enhance their performance, all the VSL BRNGs are highly optimized for various architectures of Intel processors. Besides, VSL provides a number of different techniques for transforming uniformly distributed random numbers into a sequence of required distribution. All the random number generators that are implemented in VSL are of vector type. Unlike scalar type generators, for example, a standard rand() function, when the function output is a successive random number, vector generators produce a vector of n successive random numbers of a given distribution with given parameters. VSL is a thread-safe library convenient for parallel computing with a great variety of configurations of parallel systems. A random stream is a basic notion in the RNG subcomponent of VSL. Mechanism of streams provides simultaneous generation of several random number sequences produced by one or more basic generators, as well as splitting of the original sequence into several subsequences by the leapfrog and block-split methods. Several random streams are particularly useful not only in parallel applications but in sequential programs as well. 7.1 Why Vector Type Generators? Due to architectural features of modern computers vector type library subroutines often perform much more efficiently than scalar type routines. In other words, the overhead expenses are often comparable with the total time required for computations. Certainly, there are subroutines where overhead expenses are negligible in comparison with the total time required for computation. However, this is not usually the case with highly optimized RNGs. To reduce overhead expenses, all VSL random number generator subroutines are of vector type. User is free to call a vector random number generator subroutine to generate just one random number, however, such use is hardly efficient. On the one hand, vector type random number generators sometimes require more careful programming. A reward in this case is a substantial speedup in overall application performance. On the other hand, VSL provides a number of services to make vector programming as natural as possible. See Independent Streams. Leapfrogging and Block-Splitting section and Abstract Basic Random Number Generators. Abstract Streams section for further discussion. Example of VSL Use 13 Disregarding possible programming issues, the vector type interface is quite natural for Monte Carlo methods because Monte Carlo requires a lot of random numbers rather than just one. 7.2 Basic Generators As indicated above, the basic generators may serve to obtain random numbers of various statistical distributions. Non-uniform distribution generators strongly depend on the quality of the underlying basic generators. Besides, as we have already mentioned, at present there is no such basic generator that would be fully adequate for any application. Many of the current generators are useless and simply dangerous for a certain category of tasks. In a number of applications quality requirements for RNGs prevail over other requirements, such as speed, memory use, etc. In some other tasks quality requirements are not that stringent and speed criterion or efficiency in generating random number subsequences are of higher importance. Some applications use random numbers as real ones, while others treat random numbers as a bit stream. It should be noted that, even if a basic generator has trouble providing true randomness for lower bits, it is not necessarily inadequate for applications using variates as real numbers. All of the above arguments testify to the fact that a library of general-purpose RNGs should provide a set of several different basic generators, both pseudo- and quasi-random. Besides, such a library should provide an option of including new basic generators, which you may find preferable. VSL provides a variety of basic pseudo- and quasi-random number generators yet allowing the user to register user-defined basic generators and also utilize random numbers generated externally, for example, from physical source of random numbers [Jun99]. See Abstract Basic Random Number Generators. Abstract Streams section for details. One of the important issues for computational experimentation is verification of the results. Typically, a researcher is unable to verify the output since the solution is simply unknown. Without going into details of verification for sophisticated simulation systems, we would state that any verification process involves testing of each structural element of the system. A random number generator, being one of such structural elements, may bring about inadequate results. Therefore, to obtain more reliable results of the experiment, many authors recommend that several different basic generators should be used in a series of computational experiments. This is yet another argument favoring inclusion of several BRNGs of different types in a library. VSL provides the following basic pseudorandom number generators: • MCG31m1. A 31-bit multiplicative congruential generator. • R250. A generalized feedback shift register generator. • MRG32k3a. A combined multiple recursive generator with two components of order 3.Intel(R) MKL Vector Statistical Library Notes 14 • MCG59. A 59-bit multiplicative congruential generator. • WH. A set of 273 Wichmann-Hill combined multiplicative congruential generators. ( j = 1, 2, ... , 273 ) Note: The variables xn, yn, zn, wn in the above equations define a successive member of integer subsequence set by recursion. The variable un is the generator real output normalized to the interval (0, 1). • MT19937. Mersenne Twister pseudorandom number generator. , , , , , , , , , , , , , , , , , where matrix ( ) has the following format:Example of VSL Use 15 where 32-bit vector has the value . • SFMT19937. SIMD-oriented Fast Mersenne Twister pseudorandom number generator. where , , ... are 128-bit integers, and , , , are sparse 128 x 128 binary matrices for which , , , operations are defined as follows: , left shift of 128-bit integer by followed by exclusive-or operation , right shift of each 32-bit integer in quadruple by followed by andoperation with quadruple of 32-bit masks , , right shift of 128-bit integer by , left shift of each 32-bit integer in quadruple by . , , k-th 32-bit integer in quadruple . Parameters of the generator take the following values: , , , , , , , . MT2203. A set of 6024 Mersenne-Twister pseudorandom number generators ( ). , , , , , , , , , , , , , , , where matrix ( ) has the following format:Intel(R) MKL Vector Statistical Library Notes 16 , where 32-bit vector . In addition, two basic quasi-random number generators are available in VSL. • SOBOL (with Antonov-Saleev [Ant79] modification). A 32-bit Gray code-based generator producing low-discrepancy sequences for dimensions . Note 1: The value c is the rightmost zero bit in n-1; is an s-dimensional vector of 32-bit values. The s-dimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . Note 2: Initialization parameters for SOBOL supported by VSL provide default dimensions . User also has an opportunity to pass user-defined initialization parameters into the generator and obtain quasi-random vectors of desirable dimension. • NIEDERREITER (with Antonov-Saleev [Ant79] modification). A 32-bit Gray code-based generator producing low-discrepancy sequences for dimensions . Note : Initialization parameters for NIEDERREITER supported by VSL provide default dimensions . User also has an opportunity to pass user-defined parameters into the generator and obtain quasi-random vectors of desirable dimension. • ABSTRACT. Abstract source of random numbers. See Abstract Basic Random Number Generators. Abstract Streams section for details. Below we discuss each basic generator in more detail and provide references for further reading. 7.2.1.1 MCG31m1 32-bit linear congruential generators, which also include MCG31m1 [L’Ecuyer99], are still used as default RNGs in various systems mostly due to simplicity of implementation, speed of operation, and compatibility with earlier versions of the systems. However, their period lengths do not meet the requirements for modern basic random number generators. Nevertheless, MCG31m1 possesses good statistical properties and may be used to advantage in generating random numbers of various distribution types for relatively small samplings. 7.2.1.2 R250 R250 is a generalized feedback shift register generator. Feedback shift register generators possess extensive theoretical footing and were first considered as RNGs for cryptographic and communications applications. Generator R250 proposed in [Kirk81] is fast and simple in implementation. It is common Example of VSL Use 17 in the field of physics. However, the generator fails a number of tests, a 2D self-avoiding random walk [Ziff98] being an example. 7.2.1.3 MRG32k3a A combined generator MRG32k3a [L’Ecu99] meets the requirements for modern RNGs: good multidimensional uniformity, fairly large period, etc. Besides, being optimized for various Intel® architectures, this generator rivals the other VSL BRNGs in speed. 7.2.1.4 MCG59 A multiplicative congruential generator MCG59 is one of the two basic generators implemented in NAG Numerical Libraries [NAG] (see www.nag.co.uk). Since the module of this generator is not prime, its period length is not 2 59 , but just 2 57 , if the seed is an odd number. A drawback of such generators is well-known (for example, see [Knuth81], [L’Ecu94]): the lower bits of the output sequence are not random, therefore breaking numbers down into their bit patterns and using individual bits may cause trouble. Besides, block-splitting of the sequence over the entire period into 2 d similar blocks results in full coincidence of such blocks in d lower bits (see, for instance, [Knuth81], [L’Ecu94]). 7.2.1.5 WH WH is a set of 273 different basic generators. It is the second basic generator in NAG libraries. The constants ai,j are in the range 112 to 127 and the constants mi,j are prime numbers in the range 16718909 to 16776971, which are close to 2 24 . These constants have been chosen so that they give good results with the spectral test, see [Knuth81] and [MacLaren89]. The period of each WichmannHill generator would be at least 2 92 , if it were not for common factors between (m1,j -1), (m2, j-1), (m3,j - 1), and (m4,j -1). However, each generator should still have a period of at least 2 80 . Further discussion of the properties of these generators is given in [MacLaren89], which shows that the generated pseudo-random sequences are essentially independent of one another according to the spectral test. 7.2.1.6 MT19937 The Mersenne Twister pseudorandom number generator [Matsum98] is a modification of a twisted generalized feedback shift register generator proposed in [Matsum92], [Matsum94]. Properties of the algorithm (the period length equal to 2 19937 -1 and 623-dimensional equidistribution up to 32-bit accuracy) make this generator applicable for simulations in various fields of science and engineering. Initialization procedure is essentially the same as described in [MT2002]. 7.2.1.7 MT2203 The set of 6024 MT2203 pseudorandom number generators is an addition to MT19937 generator intended for application in large scale Monte Carlo simulations performed on distributed multiprocessor systems. Parameters of the MT2203 generators are calculated using the methodology described in [Matsum2000] that provides mutual independence of the corresponding random number sequences. Every MT2203 generator has a period length equal to 2 2203 -1 and possesses 68- dimensional equidistribution up to 32-bit accuracy. Initialization procedure is essentially the same as described in [MT2002]. 7.2.1.8 SFMT19937 The SIMD-oriented Fast Mersenne Twister pseudorandom number generator [Saito08] is analogous to the MT19937 generator and makes use of Single Instruction Multiple Data (SIMD) and multi-stage pipelining CPU features. SFMT19937 generator has a period of a multiple of 2 19937 -1 and better equidistribution property than MT19937. 7.2.1.9 SOBOL Bratley and Fox [Brat88] provide an implementation of the Sobol quasi-random number generator. VSL implementation allows generating Sobol’s low-discrepancy sequences of length up to 2 32 . This implementation also allows for registration of user-defined parameters (direction numbers or initial direction numbers and primitive polynomials) during the initialization, which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default Intel(R) MKL Vector Statistical Library Notes 18 values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 40 inclusive. 7.2.1.10 NIEDERREITER According to the results of Bratley, Fox, and Niederreiter [Brat92] Niederreiter’s sequences have the best known theoretical asymptotic properties. VSL implementation allows generating Niederreiter’s low-discrepancy sequences of length up to 2 32 . This implementation also allows for registration of user-defined parameters (irreducible polynomials or direction numbers), which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 318 inclusive. VSL provides an option of registering one or more new basic generators that you see as preferable or more reliable. Use them in the same way as the BRNGs available with VSL. The registration procedure makes it easy to include a variety of user-designed generators. 7.2.1.11 ABSTRACT Abstract basic generators are designed to allow VSL distribution generators to be used with underlying uniform random numbers that are already generated. There are several cases when this feature might be useful: • random numbers of the uniform distribution are generated externally [Mars95] (for example, in physical device [Jun99]); • you want to study the system using the same uniform random sequence but under different distribution parameters [Mikh2000]. It is unnecessary to generate uniform random numbers as many times as many different parameters you want to investigate. There might be other cases when abstract basic generators are useful. See Abstract Basic Random Number Generators. Abstract Streams section for further reading. Due to specificity of abstract basic generators, vslNewStream and vslNewStreamEx functions cannot be used to create abstract streams. Special vsliNewAbstractStream, vslsNewAbstractStream, and vsldNewAbstractStream functions are provided to initialize integer, single precision, and double precision abstract streams respectively. Each of the VSL basic generators consists of 4 subroutines: ? Stream Initialization Subroutine. See the section Random Streams and RNGs in Parallel Computation for details. ? Integer Output Generation Subroutine. Every generated integral value (within certain bounds) may be considered a random bit vector. For details on randomness of individual bits or bit groups, see Basic Random Generator Properties and Testing Results. ? Single Precision Floating-Point Random Number Vector Generation Subroutine. The subroutine generates a real arithmetic vector of uniform distribution over the interval [a, b]. ? Double Precision Floating-Point Random Number Vector Generation Subroutine. The subroutine generates a real arithmetic vector of uniform distribution over the interval [a, b]. 7.3 Random Streams and RNGs in Parallel Computation This section describes the usage model for random streams and RNGs, including their creation, initialization, copying, saving, and restoring. 7.3.1 Initializing Basic Generator To obtain a random number sequence from a given basic generator, you should assign initial, or seed values. The assigning procedure is called the generator initialization (the C language function analogous with the initialization function is srand(seed)) in stdlib.h). Different types of basic Example of VSL Use 19 generators require a different number of initial values. For example, the seed for MCG31m1 is an integral number within the range from 1 to 2 31 -2, the initial values for MRG32k3a are a set of two triples of 32-bit digits, and the seed for MCG59 is an integer within the range from 1 to 2 59 -1. In contrast to the pseudorandom number generators, quasi-random generators require the dimension parameter on input. Thus, each BRNG, including those registered by the user, requires an individual initialization function. However, requiring individual initialization functions within the library interface would limit the versatility of the routines. The basic concept of VSL is to provide an interface with universal mechanism for generator initialization, while encapsulating details of the initialization process from the user. (Nevertheless, the initialization process is clearly documented in VSL Notes for each library basic generator). In line with this concept, VSL offers two subroutines to initialize any basic generator (see the functions of random stream creation and initialization in Random Streams section). These initialization functions can also be used to initialize user-supplied functions. One of the subroutines initializes a given basic generator using one 32-bit initial value, which is called the seed by tradition. If the generator requires more than one 32-bit seed, VSL initializes the remaining initial values on the basis of the original seed. Thus, generator R250, which requires 250 initial 32-bit values, is initialized using one 32-bit seed by the method described in [Kirk81]. The second subroutine is a generalization of the first one. It initializes a basic generator by passing an array of n 32-bit initial values. If the number of the initial values n is insufficient to initialize a given basic generator, the missing initial values are initialized by default values. On the contrary, if the number of the initial values n is excessive, the redundant values are ignored. For details on initialization procedure see Basic Random Generator Properties and Testing Results. When calling initialization functions you may ignore acceptability of the passed initial values for a given basic generator. If the passed seeds are unacceptable, the initialization procedure replaces them with those acceptable for a given type of BRNG. See Basic Random Generator Properties and Testing Results for details on acceptable initial values. If you add a new basic generator to VSL, you should implement an appropriate initialization function, which supports the above mechanism of initial values passing, and, if required, apply the leapfrog and block-splitting techniques. 7.3.2 Creating and Initializing Random Streams VSL assumes that at any moment during the program operation you may simultaneously use several random number subsequences generated by one or more basic generators. Consider the following scenarios: ? The simulation system has several independent structural blocks of random number generation (for example, one block generates random numbers of normal distribution, another generates uniformly distributed numbers, etc.) Each of the blocks should generate an independent random number sequence, that is, each block is assigned an individual stream that generates random numbers of a given distribution. ? It is necessary to study correlation properties of the simulation output with different distribution parameters. In this case it looks natural to assign an individual random number stream (subsequence) to each set of the parameters. For example, see [Mikh2000]. ? Each parallel process (computational node) requires an independent random number subsequence of a given distribution, that is, a random number stream. A random stream means a certain abstract source of random numbers. By linking such a stream to a specific basic generator and assigning specific initial values we predetermine the random number sequence produced by this particular stream. In VSL a universal stream state descriptor identifies every random number stream (in C language this is just a pointer to the structure). The descriptor specifies the dynamically allocated memory space that contains information on the respective basic generator and its current state as well as some additional data necessary for the leapfrog and/or skipahead method. VSL has two stream creation and initialization functions: vslNewStream( stream, brng, seed ) vslNewStreamEx( stream, brng, n, params )Intel(R) MKL Vector Statistical Library Notes 20 Each of these subroutines allocates memory space to store information on the basic generator brng, its current state, etc., and then calls the initialization function of the basic generator brng that fills the fields of the generator current state with relevant initial values. The initial values are defined either by one 32-bit value seed (for vslNewStream) or an array of n 32-bit initial values params (for vslNewStreamEx). The output of vslNewStream and vslNewStreamEx is the pointer to stream, that is, the stream state descriptor. You can create any number of streams through multiple calls of vslNewStream or vslNewStreamEx functions. For example, you can generate several thread-safe streams that are linked to the same basic generator. The generated streams are further identified by their stream state descriptors. Although a random number stream is a source of random numbers produced by a basic generator, that is, a generator of uniform distribution, you can generate random numbers of non-uniform distribution using streams. To do this, the stream state descriptor is passed to the transformation function that generates random numbers of a given distribution. Each function uses the stream state descriptor to produce random numbers of a uniform distribution, which are further transformed into sequences of the required distribution. See the section Generating Methods for Random Numbers of Non-Uniform Distribution for details. When a given random number stream is no longer needed, delete it by calling vslDeleteStream function: vslDeleteStream( stream ) This function frees the memory space related to the stream state descriptor stream. After that, the descriptor can no longer be used. 7.3.3 Creating Random Stream Copy and Copying Stream State VSL provides an option of producing an exact copy of a generated stream by calling the vslCopyStream function: vslCopyStream( newstream, srcstream ) A new stream newstream is created with parameters (stream descriptive information) that are exactly the same as those of the source stream srcstream at the moment of calling vslCopyStream. The stream state of newstream will be exactly the same as that of srcstream, and both the streams will generate random numbers using the same basic generator. Another service function vslCopyStreamState copies the current state of the stream: vslCopyStreamState( deststream, srcstream ) The streams srcstream and deststream are assumed to have been created by one of the above methods, both of the streams being related to the same basic generator. The function vslCopyStreamState copies the information about the current stream state from srcstream into deststream. Other stream-related information remains unchanged. 7.3.4 Saving and Restoring Random Streams Typically, to get one more correct decimal digit in Monte Carlo, you need to increase the sample by a factor of 100. That makes Monte Carlo applications computationally expensive. Some of them take days or weeks while others may take several months of computations. For such applications, saving intermediate results to a file is essential so as be able to continue computation using that result in case the application is terminated intentionally or abnormally. In the case of basic generators, saving intermediate results means that BRNG state and other descriptive data, if any, should be saved to a binary file. Since BRNG state is not directly accessible for the user, who operates with the random stream descriptor only, VSL provides routines to save/restore random stream descriptive data to and from binary files:Example of VSL Use 21 errstatus = vslSaveStreamF( stream, fname, ) errstatus = vslLoadStreamF( &stream, fname ) The binary file name is specified by the fname parameter. In the vslSaveStreamF function a valid random stream to be written is specified by a stream input parameter. In vslLoadStreamF the stream is the output parameter that specifies a random stream that has been created on the basis of the binary file data. Each of these functions returns the error status of the operation. Non-negative value indicates an error. 7.3.5 Independent Streams. Leapfrogging and Block-Splitting One of the basic requirements for random number streams is their mutual independence and lack of intercorrelation. Even if you want random number samplings to be correlated, such correlation should be controllable. The independence of streams is provided through a number of methods. We discuss three of them, all supported by VSL, in greater detail. • For each of the streams you may use the same type of generators (for example, linear congruential generators), but choose their parameters in such a way as to produce independent output random number sequences. The Mersenne Twister generator is a good example here. It has 1024 parameter sets, which ensure that the resulting subsequences are independent (see [Matsum2000] for details). Another example is WH generator capable of creating up to 273 random number streams. The produced sequences are independent according to the spectral test (see [Knuth81] for the spectral test details). • Split the original sequence into k non-overlapping blocks, where k is the number of independent streams. Each of the streams generates random numbers only from the corresponding block. This method is known as block-splitting or skipping-ahead. • Split the original sequence into k disjoint subsequences, where k is the number of independent streams, in such a way that the first stream would generate the random numbers x1, xk+1, x2k+1, x3k+1, ..., the second stream would generate the random numbers x2, xk+2, x2k+2, x3k+2, ..., and, finally, the kth stream would generate the random numbers xk, x2k, x3k, ... This method is known as leapfrogging. Note, however, that multidimensional uniformity properties of each subsequence deteriorate seriously as k grows. The method may be recommended if k is fairly small. Karl Entacher presents data on inadequate subsequences produced by some commonly used linear congruential generators [Ent98]. VSL allows you to use any of the above methods, leapfrog and skip-ahead (block-split) methods deserving special attention. VSL implements block-splitting through the function vslSkipAheadStream: vslSkipAheadStream( stream, nskip ) The function changes current state of the stream stream so that with the further call of the generator the output subsequence would begin with the element xnskip rather than with the current element x0. Thus, if you wish to split the initial sequence into nstreams blocks of nskip size each, the following sequence of operations should be implemented: Option 1 VSLStreamStatePtr stream[nstreams]; int k; for ( k=0; ka /* Get successive non-uniform random number */ w := Nonuniform() // get successive uniform random number from BRNGExample of VSL Use 25 // and transform it to non-uniform random number /* Return i-th result */ r[i] := g(u,v,w) end do Minimization of control flow dependency is one of the valuable means to boost the performance on the modern processor architectures. In particular, this means that you should try to generate and process random numbers as vectors rather than as scalars: 1. Generate vector U of pairs (u, v) 2. Applying "good candidate" criterion f(u,v)>a, form a new vector V that consists of "good" candidates only. 3. Get vector W of non-uniform random numbers w. 4. Get vector R of results g(u,v,w). Note that steps 1- 4 do not preserve the original order of underlying uniform random numbers utilization. Consider an example below, if you need to keep the original order. Suppose that one underlying uniform random number is required per non-uniform. So underlying uniform random numbers are utilized as follows: To keep the original order of underlying uniform random number utilization, yet applying the vector random number generator effectively, pack "good" candidates into one buffer while packing random numbers to be used in non-uniform transformation into another buffer: To apply non-uniform distribution transformation, that is, to use a VSL distribution generator, for x7, x10, x17, x22, ... stored in a buffer W, you need to create an abstract stream that is associated with buffer W. Types of Abstract Basic Random Number Generators VSL provides three types of abstract basic random number generators intended for: • integer-valued buffers • single precision floating-point buffers • double precision floating-point buffers Corresponding abstract stream initialization subroutines are: vsliNewAbstractStream( &stream, n, ibuf, icallback );Intel(R) MKL Vector Statistical Library Notes 26 vslsNewAbstractStream( &stream, n, sbuf, a, b, scallback ); vsldNewAbstractStream( &stream, n, dbuf, a, b, dcallback ); Each of these routines creates a new abstract stream stream and associates it with a corresponding cyclic buffer [i,s,d]buf of length n. Data in floating-point buffers is supposed to have uniform distribution over (a,b) interval. An obligatory parameter is a user-provided callback function [i,s,d]callback to update the associated buffer when the quantity of random numbers required in the distribution generator becomes insufficient in that buffer. A user-provided callback function has the following format: int MyUpdateFunc( VSLStreamStatePtr stream, int* n, buf, int* nmin, int* nmax, int* idx ) { ... /* Update buf[] starting from index idx */ ... return nupdated; } For Fortran-interface compatibility, all parameters are passed by reference. The function renews the buffer buf of size n starting from position idx. Note that the buffer is considered as cyclic and index idx varies from 0 to n-1. Minimal number of buffer entries to be updated is nmin. Maximum number of buffer entries that can be updated is nmax. To minimize callback call overheads, update as many entries as possible (that is, nmax entries), if an algorithm specifics allows this. If you utilize multiple abstract streams, creation of multiple callback functions is not required. Instead, you may have one callback function and distinguish a particular abstract stream and a particular buffer using the stream and buf parameters respectively. The callback function should return the quantity of numbers that have been actually updated. Typically, the return value would be a number between nmin and nmax. If the callback function returns 0 or the number greater than nmax, the abstract basic generator reports an error. It is allowable however to update less than nmin numbers (but greater than 0). In this case, the corresponding abstract generator calls the callback function again until at least nmin numbers are updated. Of course, this is inefficient but still may be useful if there are no nmin numbers by the moment of the callback function call. The respective pointers to the callback functions are defined as follows: typedef int (*iUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, unsigned int ibuf[], int* nmin, int* nmax, int* idx ); typedef int (*dUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, double dbuf[], int* nmin, int* nmax, int* idx ); typedef int (*sUpdateFuncPtr)( VSLStreamStatePtr stream, int* n, float sbuf[], int* nmin, int* nmax, int* idx ); On the user level, an abstract stream looks like a usual random stream and can be used with any service and distribution generator routines. In many cases, more careful programming is required, however, while using abstract streams. For instance, checking the distribution generator status to determine whether the callback function has successfully updated the buffer or not is a good practice in working with abstract streams. Another important note is that a buffer associated with an abstract stream must not be updated manually, that is, outside of a callback function. In particular, this means that the buffer should not be filled with numbers by the moment of abstract stream initialization with vsl[i,s,d]NewAbstractStream function call. Type of the abstract stream to be created should be also chosen carefully. This type depends on a particular distribution generator routine. For instance, all single precision continuous distribution Example of VSL Use 27 generator routines utilize abstract streams associated with single precision buffers, while double precision distribution generators utilize abstract streams associated with double precision buffers. Most of discrete distribution generators utilize abstract streams that are associated with either single or double precision abstract streams. See the following table to choose the appropriate type of an abstract stream: Type of Discrete Distribution Type of Abstract Stream Uniform double precision UniformBits integer Bernoulli single precision Geometric single precision Binomial double precision Hypergeometric double precision Poisson (VSL_METHOD_IPOISSON_POISNORM) single precision Poisson (VSL_METHOD_IPOISSON_PTPE) single and double precision PoissonV single precision NegBinomial double precision The following example demonstrates generation of random numbers of the Poisson distribution with parameter using an abstract stream. Random numbers are assumed to be uniform integers from 0 to 231-1 and are stored in the ran_nums.txt file. In the callback function, the numbers are transformed to double precision format and normalized to (0,1) interval. #include #include "mkl_vsl.h" #define METHOD VSL_METHOD_IPOISSON_PTPE #define N 4500 #define DBUFN 1000 #define M 0x7FFFFFFF /* 2^31-1 */ static FILE* fp; int MydUpdateFunc(VSLStreamStatePtr stream, int* n, double dbuf[], int* nmin, int* nmax, int* idx) { int i; unsigned int num; double c; c = 1.0 / (double)M; for ( i = 0; i < *nmax; i++ ) {Intel(R) MKL Vector Statistical Library Notes 28 if ( fscanf(fp, "%u", &num) == EOF ) break; dbuf[(*idx+i) % (*n)] = num; } return i; } int main() { int errcode; double lambda, a, b; double dBuffer[DBUFN]; int r[N]; VSLStreamStatePtr stream; /* Boundaries of the distribution interval */ a = 0.0; b = 1.0; /* Parameter of the Poisson distribution */ lambda = 3.0; fp = fopen("ran_nums.txt", "r"); /***** Initialize stream *****/ vsldNewAbstractStream( &stream, DBUFN, dBuffer, a, b, MydUpdateFunc ); /***** Call RNG *****/ errcode = viRngPoisson(VSL_RNG_METHOD_POISSON_PTPE,stream,N,r,lambda); if (errcode == VSL_ERROR_OK) { /* Process vector of the Poisson distributed random numbers */ ... } else { /* Process error */ ... } ...Example of VSL Use 29 vslDeleteStream( &stream ); fclose(fp); return 0; } 7.4 Generating Methods for Random Numbers of NonUniform Distribution You can use a source of uniformly distributed random numbers to generate both discrete and continuous distributions, which is implemented through a number of methods briefly described below. 7.4.1 Inverse Transformation The probability distribution of a one-dimensional variate X may be most generally presented in terms of cumulative distribution function (CDF): . Any CDF is defined on the whole real axis and is monotonically increasing, where . In the case of continuous distribution, the cumulative distribution function F(x) is a continuous one. In what follows, we assume that F(x) is steadily increasing, though assuming a non-steadily increasing function with a limited number of intervals where it steadily increases leads to trivial complications and generalizations of what follows. Assuming the CDF steadily increases, the following single-valued inverse function should exist: . It is easy to prove that, if U is a variate with a uniform distribution on the interval (0, 1), then the variate X is of F(x) distribution. Thus, the inverse transformation method can be implemented as follows: 1. Generate a uniformly distributed random number meeting the requirements: 0 < u < 1. 2. Assume x = G(u) as a random number of the distribution F(x). The only drawback of this approach is that G(u) in closed form is often hard to find, while numerical solution to the equation to calculate x is, as a rule, excessively time consuming. For discrete distributions, the CDF is a step function, the inverse transformation method still being applicable. For simplicity, let us assume that the distribution has probability mass points k = 0, 1, 2, ... with pk probability. Then the distribution function is the sum Intel(R) MKL Vector Statistical Library Notes 30 , where is the maximum integer that does not exceed x. If a continuous function G(u) exists in closed form so that , and G(u) is monotone, then generation of random numbers of the distribution F(x) can be implemented as follows: 1. Generate a uniformly distributed random number meeting the requirements: 0 < u < 1. 2. Assume k = floor(G(u)) as a random number of the distribution F(x). For example, for the geometric distribution . Then G(u) does exist, as it easy to prove, . However, for most cases finding the closed form for G(u) function is too hard. An acceptable solution may be found using numerical search for k proceeding from . With tabulated values of F(k), the task is reduced to table lookup. As F(k) is a monotonically increasing function, you may use search algorithms that are considerably more efficient than exhaustive search. The efficiency is solely dependent on the size of the table. Inverse transformation method can be applied to the s-dimensional quasi-random vectors. The resulting quasi-random sequence has the required s-dimensional non-uniform distribution. 7.4.2 Acceptance/Rejection The cumulative distribution function, let alone the inverse one, is very often much more complex computationally than the probability density function (for continuous distributions) and the probability mass function (for discrete distributions). Therefore, methods based on the use of density (mass) functions are often more efficient than the inverse transformation method. We will consider a case of continuous probability distribution, although this technique is just as effective for discrete distributions. Suppose, we need to generate random numbers x with distribution density f(x). Apart from the variate X, let us consider the variate Y with the density g(x), which has a fast method of random number generation and the constant c such that .Example of VSL Use 31 Then, it is easy to conclude that the following algorithm provides generation of random numbers x with the distribution F(x): 1. Generate a random number y with the distribution density g(x). 2. Generate a random number u (independent of y) that is uniformly distributed over the interval (0, 1). 3. If , accept y as a random number x with the distribution F(x); else go back to Step 1. The efficiency of this method greatly depends on degree of complexity of random number generation with distribution density g(x), computational complexity for the functions f(x) and g(x), as well as on the constant c value. The closer c is to 1, the lower the necessity to reject the generated y. Note: Since quasi-random sequences are non-random, great care should be taken when using quasirandom basic generators with the acceptance/rejection methods. 7.4.3 Mixture of Distributions Sometimes it may be useful to split the initial distribution into several simpler distributions: , so that random numbers for each of the distributions Fi(x) are easy to generate. Then the appropriate algorithm may be as follows: 1. Generate a random number i with the probability pi . 2. Generate a random number y (independent of i) with the distribution Fi (x). 3. Accept y as a random number x with the distribution F(x). This technique is most common in the acceptance/rejection method, when for the whole range of acceptable x values a density g(x), which would approximate the function f(x) well enough, is hard to find. In this case, the range is divided into sections so that g(x) looks relatively simple in each of the sub-ranges. Note: Since quasi-random sequences are non-random, great care should be taken when using quasirandom basic generators with the mixture methods. 7.4.4 Special Properties The most efficient algorithms, though based on the general methods described in the previous sections, should, nevertheless, make use of special properties of distributions, if possible. For example, the inverse transformation method is inapplicable to normal distribution directly. However, use of polar coordinates for a pair of independent normal variates makes it possible to develop an efficient method of random number generation based on 2D inverse transformation, which is known as the Box-Muller method: Generating s-dimensional normally distributed quasi-random sequences with 2D inverse transformation (VSL name is the Box-Muller2 method), when s is odd, seems to be problematic because quasi-random numbers are generated in pairs. One of the options is to generate (s+1)- dimensional normally distributed quasi-random numbers from (s+1)-dimensional quasi-random numbers produced by a basic quasi-random generator and then ignore the last dimension.Intel(R) MKL Vector Statistical Library Notes 32 Another option is to use the method that produces one normally distributed number from two uniform ones (VSL name is the Box-Muller method). In this case to generate s-dimensional normally distributed quasi-random numbers, use 2s-dimensional quasi-random numbers produced by a basic quasi-random generator. For a binomial distribution with parameters m, p, the probability mass function is found as follows: . For p > 0.5, it is convenient to make use of the fact that . To summarize, we note that a uniform distribution can be converted to a general distribution by a number of methods. Also, two different transformation techniques implemented for one and the same uniform distribution produce two different sequences of a general distribution, though possessing the same statistical properties. Let us consider a simple example. If U1, U2 are two independent random values uniformly distributed over the interval (0, 1), that is, with the distribution function F(x) = x , 0 < x < 1, then the variate X = max(U1, U2) has the distribution F(x) ·F(x). Thus, on the one hand, the random number x1 with maximum distribution from two independent uniform distributions may be derived either from a pair of uniformly distributed random numbers u1, u2 as x1 = max(u1, u2) or from one uniform random number u1 as x1 = sqrt(u1) by applying the inverse transformation method. It is obvious that applying two different methods to one and the same sequence u1, u2, u3, ... will give two absolutely different sequences xi . Transformation into non-uniform distribution sequences may be accomplished in a variety of ways with no fastest or most accurate method existing, as a rule. The inverse transformation method may be preferable over the acceptance/rejection method for some applications and architectures, while reverse preference is common for others. Taking this into account, the VSL interface provides different options of random number generation for one and the same probability distribution. For example, a Poisson distribution may be transformed by two different methods: the first, known as PTPE [Schmeiser81], is based on acceptance/rejection and mixture of distributions techniques, while the second one is implemented through transformation of normally distributed random numbers. The method number calls a method for a specified generator, for example: viRngPoisson( VSL_METHOD_IPOISSON_PTPE, stream, n, r, lambda ) - calling PTPE method by passing the method number VSL_METHOD_IPOISSON_PTPE. viRngPoisson( VSL_METHOD_IPOISSON_POISNORM, stream, n, r, lambda ) - calling transformation from normally distributed random numbers by passing the method number VSL_METHOD_IPOISSON_POISNORM. For details on methods to be used for specific distributions see Continuous Distribution Functions and Discrete Distribution Functions sections. 7.5 Accurate and Fast Modes of Random Number Generation Using the distribution generators in the application the user can expect the obtained random numbers to belong to definitional domain of the corresponding distribution irrespective of its parameters. For example, uniformly distributed on random numbers obtained as output of the relevant generator are assumed to satisfy the following condition: for all indices and for all values of and . However, due to specificity of floating point calculations and rounding modes some continuous distribution generators may produce random numbers lying beyond definitional domain for Example of VSL Use 33 some particular values of distribution parameters. Such state of affairs cannot be acceptable in those applications for which accuracy of calculations is highly critical. To resolve this issue, VSL defines two modes of random number generation: accurate and fast. A generation mode is initialized during call of the distribution generator by specifying value of the method parameter. For example, accurate generation of single precision floating point numbers from distribution uniform on interval in C looks like this ... status=vsRngUniform(VSL_METHOD_SUNIFORM_STD_ACCURATE, stream, n, r, a, b); ... So, if a Monte Carlo application uses several distribution generators, each of them can be called in preferable mode. When used in accurate mode, the generators produce random numbers that belong to definitional domain for all parameter values of the distribution. See the table below for a list of generators supporting accurate mode of calculations. Type of Distribution Data Types Uniform s,d Exponential s,d Weibull s,d Raleigh s,d Lognormal s,d Gamma s,d Beta s,d The distribution generators used in the fast mode produce numbers beyond the definitional domain in relatively rare cases. The application should set accurate mode if all generated random numbers are expected to belong to the definitional domain irrespective of distribution parameter values. Use of the accurate mode makes slight performance degradation for random number generation possible. 7.6 Example of VSL Use A typical algorithm for VSL generators is as follows: 1. Create and initialize stream/streams. Functions vslNewStream, vslNewStreamEx, vslCopyStream, vslCopyStreamState, vslLeapfrogStream, vslSkipAheadStream. 2. Call one or more RNGs. 3. Process the output. 4. Delete the stream/streams. Function vslDeleteStream. Note: You may reiterate steps 2-3. Random number streams may be generated for different threads. The following example demonstrates generation of two random streams. The first of them is the output of the basic generator MCG31m1 and the second one is the output of the basic generator R250. The seeds are equal to 1 for each of the streams. The first stream is used to generate 1,000 normally distributed random numbers in blocks of 100 random numbers with parameters a = 5 and sigma = 2. The second stream is used to produce 1,000 exponentially distributed random numbers in blocks of 100 random numbers with parameters a = -3 and beta = 2. Delete the streams after completing the generation. The purpose is to calculate the sample mean for normal and exponential distributions with the given parameters.Intel(R) MKL Vector Statistical Library Notes 34 #include #include "mkl.h" float rn[100], re[100]; /* buffers for random numbers */ float sn, se; /* averages */ VSLStreamStatePtr streamn, streame; int i, j; /* Initializing */ sn = 0.0f; se = 0.0f; vslNewStream( &streamn, VSL_BRNG_MCG31, 1 ); vslNewStream( &streame, VSL_BRNG_R250, 1 ); /* Generating */ for ( i=0; i<10; i++ ) { vsRngGaussian( VSL_METHOD_SGAUSSIAN_BOXMULLER2, streamn, 100, rn, 5.0f, 2.0f ); vsRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, streame, 100, re, -3.0f, 4.0f ); for ( j=0; j<100; j++ ) { sn += rn[j]; se += re[j]; } } sn /= 1000.0f; se /= 1000.0f; /* Deleting the streams */ vslDeleteStream( &streamn ); vslDeleteStream( &streame ); /* Printing results */ printf( "Sample mean of normal distribution = %f\n", sn ); printf( "Sample mean of exponential distribution = %f\n", se ); When you call a generator of random numbers of normal (Gaussian) distribution, use the named constant VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 to invoke the Box-Muller2 generation method. In the case of a generator of exponential distribution, assign the method by the named constant VSL_RNG_METHOD_EXPONENTIAL_ICDF. The following example generates 100 three-dimensional quasi-random vectors in the hypercube using SOBOL BRNG. #include #include "mkl.h" float r[100][3]; /* buffer for quasi-random numbers */ VSLStreamStatePtr stream;Example of VSL Use 35 /* Initializing */ vslNewStream( &stream, VSL_BRNG_SOBOL, 3 ); /* Generating */ vsRngUniform( VSL_RNG_METHOD_UNIFORM_STD, stream, 100*3, (float*)r, 2.0f, 3.0f ); /* Deleting the streams */ vslDeleteStream( &stream );36 8 Testing of Basic Random Number Generators This section provides information on testing the Basic Random Number Generators (BRNG), including some details on BRNG properties and categories, as well as on interpretation of test results. 8.1 BRNG Implementations and Categories Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Three implementations are available for every basic generator: • integer implementation (output is a 32-bit integer sequence) • real (single precision) • real (double precision) You can use the basic generator integer output to obtain random bits or groups of bits. However, when you interpret the output of a generator, you should take into consideration the characteristics of each basic generator in general and its bit precision in particular. For detailed information on implementations of each basic generator, see Basic Random Generator Properties and Testing Results. All VSL basic generators are tested by a number of specially designed empirical tests. These tests are applied either for floating-point sequences or for integer-valued sequences. The set of tests for basic generators can be divided into three categories: • tests to analyze the randomness of bits/groups of bits • tests to analyze the randomness of real random numbers normalized to the interval (0, 1) • tests to analyze conformance to the template 8.1.1 First Category You can only use the first category tests to evaluate the basic generator integer implementation. The function viRngUniformBits corresponds to the integer implementation on the interface level. The testing in this category of tests is made with regard to characteristics of each basic generator and its bit precision in particular. You can subsequently use the results of the tests to decide if you can apply this particular basic generator to obtain random bits or groups of bits. A failed test does not mean that the generator is bad but rather that the interpretation of the integer output as the stream of random bits may result in an inadequate simulation outcome. Also, this category includes a set of tests to determine the degree of randomness of upper, medium and lower bits. For example, upper bits may NIEDERREITER 37 prove to be much more random than lower. Thus some tests may indicate which bits or groups of bits are better for use as random ones. 8.1.2 Second Category The second category contains different tests for basic generator normalized output. You can apply all these tests for real implementation of both single and double precision. Moreover, in most cases, the testing results are identical for both implementations, which proves that non-randomness of lower bits in the original integer sequence does not have practical influence on the randomness of the real basic generator output normalized to the (0, 1) interval. The functions vsRngUniform and vdRngUniform, for single and double precision respectively, correspond to real implementations on the interface level. 8.1.3 Third Category The third category contains tests to check how a basic generator output conforms to the template. Template tests variations check if the leapfrog and skip-ahead methods generate subsequences of random numbers correctly. These tests are particularly important because, if any current member of the integer sequence differs from the template in a single bit only, the resulting sequence will be totally different from the template sequence. Also, the statistical properties of such sequence are worse than those of the template sequence. This assumption is based on the fact that in a variety of sequences there are a very small number of "sufficiently random" sequences. As Knuth suggests, "random numbers should not be generated with a method chosen at random" [Knuth81]. However, situations are possible, where the random choice of the method of generation is not a result of personal preference but rather the curse of a bug. 8.2 Interpreting Test Results Testing of a generator for all possible seeds and sampling sizes is hardly practicable. Therefore we actually test only a few subsequences of various lengths. Testing a random number sequence u1, u2, ..., un gives a p-value that falls within the range from 0 to 1. Being a function of a random sampling, this p-value is a random number itself. For the sequence u1, u2, ..., un of truly random numbers, the resulting p-value is supposed to be uniformly distributed over the interval (0, 1). Significant p-value deviation from the theoretical uniform distribution may indicate a defect in the tested sequence. For example, we may consider the sequence u1, u2, ..., un suspicious, if the resulting p-value falls outside the interval (0.01, 0.99). The chance to reject a 'good' sequence in this case is 2%. Multiple testing of different subsequences of the sequence makes the statistical conclusion about the sequence randomness more substantiated with several options to arrive at such a conclusion. 8.2.1 One-Level (Threshold) Testing When we test K subsequences u1, u2, ..., un; un+1, un+2, ..., u2n; ...; u(K-1)n+1, u(K-1)n+2, ..., uKn of the original sequence, we compute p-values p1, p2, ..., pK. For a subsequence u(j-1)n+1, u(j -1)n+2, ..., ujn the test j is failed, if the value pj falls outside the interval (pl , ph) ? (0, 1). We consider the sequence u1, u2, ..., uKn suspicious when r or more test iterations failed. We have conducted threshold testing for the VSL generators with 10 iterations (K=10), the interval (pl , ph) equal to (0.05, 0.95), and r = 5. The chance to reject a 'good' sequence in this case is 0.16349374% ? 0.2%.Intel(R) MKL Vector Statistical Library Notes 38 8.2.2 Two-Level Testing When we test K subsequences u1, u2, ..., un; un+1, un+2, ..., u2n; ...; u(K-1)n+1, u(K-1)n+2, ..., uKn of the original sequence, we compute p-values p1, p2, ..., pK. Since the resulting p-values for the sequence u1, u2, ..., uKn of truly random numbers are supposed to be uniformly distributed over the interval (0, 1), we may subject those p-values to any uniformity test, thus obtaining p-value q1 of the second level. After going through this procedure L times we obtain L p-values of the second level q1, q2, ... , qL that we subject to threshold testing. We have conducted threshold second level testing for the VSL generators with 10 iterations (L=10) and applied the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to evaluate p1, p2, ..., pK uniformity. 8.3 BRNG Tests Description Most of empirical tests that are used for testing the VSL BRNGs are well documented (for example, see [Mars95], [Ziff98]). Nevertheless, we find it useful to describe them and the testing procedure in greater detail here since tests may vary as to their applicability and implementation for a particular basic generator. We also provide figures of merit that are used to decide on passing vs. failure in oneor two level testing. For ideas underlying such criteria, see Interpreting Test Results section. 8.3.1 3D Spheres Test 8.3.1.1 Test Purpose The test uses simulation to evaluate the randomness of the triplets of sequential random numbers of uniform distribution. The stable response is the volume of the sphere. The radius of the sphere is equal to the minimal distance between the generated 3D points. 8.3.1.2 First Level Test The test generates the vector ui of 12,000 random numbers (i = 0, 1, ..., 11999), which are uniformly distributed in the (0, 1000) interval. The test forms 4,000 triplets of random numbers xk = (u3k, u3k+1, u3k+2) (k = 0, 1, ..., 3999) situated in the cube R = (0, 1000)?(0, 1000)?(0, 1000). Further, the test calculates dmi n= d(xk, xl ) (l ? k), where d(x, y) is the Euclidean distance between x and y. In this case, the volume of the sphere with the dmin radius should have the distribution close to the exponential one with a = 0, ß = 40p parameters. Thus, the distribution of the p = 1 - exp(-(dmin)3/30) value should be close to the uniform distribution. The p-value is the result of the first level test. 8.3.1.3 Second Level Test The second level test performs the first level test ten times. The p-value pj , j = 1, 2, ..., 10 is the result of each first level test. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistic to the obtained set of pj (j = 1, 2, ..., 10). If the resulting p-value is p<0.05 or p>0.95, the test fails. 8.3.1.4 Final Result Interpretation The final result is the FAIL percentage for the failed first level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.1.5 Tested Generators Function Name Application vsRngUniform applicableNIEDERREITER 39 vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable Note: The test transforms the integer output into the real output within the interval (0, 1) for the function viRngUniformBits. For detailed information about the normalization of the integer output see the description of the given basic generator. 8.3.2 Birthday Spacing Test 8.3.2.1 Test Purpose The test uses simulation to evaluate the randomness of groups of 24 sequential bits of the integer output of basic generator. The test analyzes all possible groups of the kind, that is, for example, from 0 to 23 bit, from 1 to 24 bit, etc. 8.3.2.2 First Level Test The first level test selects at random m = 210 ”birthdays” from a ”year” of n = 224 days. Then the test computes the spacing between the birthdays for each pair of sequential birthdays. The test then uses the spacings to determine the K value, that is, the number of pairs of sequential birthdays with the spacing of more than one day. In this case K should have the distribution close to the Poisson distribution with the ? = 16 parameter. The first level test determines 200 values of Kj (j = 1, 2, ..., 200). To obtain the p-value p, the test applies the chi-square goodness-of-fit test to the determined values. The integer output lists different interpretations for each basic generator. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits:Intel(R) MKL Vector Statistical Library Notes 40 0-32. NB=32, WS=32. The test generates the dates of the birthdays in the following way: • Selects the bs, bs+1, ..., bs+23 bits from the next WS-bit integer of the integer output of viRngUniformBits. • Treats the selected bits as a 24-bit integer, that is, the number of the date on which the next birthday takes place and thus generates a birthday. • The test performs the steps 1 and 2 m times to generate m birthdays taken that the year consists of n days. The legitimate values s are different for each base generator (see the table above): 0 = s = NB - 24. 8.3.2.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj (j = 1, 2 , ..., 10). If the resulting p-value is p<0.05 or p>0.95, the test fails for the given s. 8.3.2.4 Final Result Interpretation The second level test performs ten times for each 0 = s = NB - 24. The test computes the FAILs percentage for the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-24) for 0 = s = NB - 24. The applicable result is the value of FAIL<50%. Thus, the test determines if it is possible to select 24 random bits from every element of the integer output of the generator. • The integer output for the WH generator is the quadruples of the 32-bits values (xi , yi , zi , wi ). In each 32-bit value only the lower 24 bits are significant. • The second level test performs ten times for the xi element. Then the test computes the FAILx percentage the failed second level tests. • The second level test performs ten times for the yi . Then the test computes the FAILy percentage for the failed second level tests. • The test performs the same procedure to compute the FAILz and FAILw values. The final result is the minimal percentage of the failed tests FAIL = min(FAILx , FAILy, FAILz, FAILw). The acceptable result is the value of FAIL < 50%. The test determines if it is possible to select 24 random bits from the fixed element x, y, z or w for each element of the integer output of the generator. 8.3.2.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicableNIEDERREITER 41 8.3.3 Bitstream Test 8.3.3.1 Test Purpose The test uses simulation to check if it is possible to interpret the integer output of the basic generator as a sequence of random bits. Note: The bit precision of a basic generator defines the sequence of random bits formation. For example, only 59 lower bits take part in the bit stream formation for the MCG59 generator, and only 31 lower bits for the MCG31 generator. 8.3.3.2 First Level Test The first level test initially forms the sequence of bits b0, b1, b2, ... from the integer output of the basic generator and then forms 20-bit overlapping words w0 = b0 b1...b19 , w1 = b1 b2...b20 , ... from the sequence. From the total number of 2021 formed words the test computes the quantity K of the missed 20-bit words. For the truly random sequence the K statistic distribution should be very close to normal with mean a = 141,909 and standard deviation s = 428. The test denotes the cumulative function of the normal distribution with these parameters as F(x). The result is that the distribution of the p-value p = F(K) should be uniform within the interval of (0, 1). BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. The test selects only NB of lower bits from each of four WS-bit elements for WH generator.Intel(R) MKL Vector Statistical Library Notes 42 8.3.3.3 Second Level Test The second level test performs the first level test 20 times. The result of each first level test is the pvalue pj , j = 1, 2, ..., 20. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj (j = 1, 2, ..., 20). If the resulting p-value is p<0.05 or p>0.95, the test fails. 8.3.3.4 Final Result Interpretation The final result of the test is the FAIL percentage of the failed second level tests. The second level test performs ten times. The acceptable result is the value of FAIL < 50%. 8.3.3.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The lower bits are not random for multiplicative congruential generators where the module is the power of two (for example, MCG59), thus, the Bitstream Test fails for such generators. 8.3.4 Rank of 31x31 Binary Matrices Test 8.3.4.1 Test Purpose The test evaluates the randomness of 31-bit groups of 31 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test performs iterations for all possible 31-bit groups of bits (0-30, 1-31, ...) for the generators with more than 31 bit precision. 8.3.4.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+30 from each element of the integer output and forms a binary matrix 31x31 in size from these 31 groups. The first level test composes 40000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 31, the number of matrices with the rank of 30, the number of matrices with the rank of 29, and the number of matrices with the rank less than 29. For the truly random sequence, the probability of composing a 31 rank matrix is 0.289, a 30 rank matrix is 0.578, a 29 rank matrix is 0.128, and a less than 29 rank matrix is 0.005. Therefore, the test divides all possible matrix ranks into four groups. The test makes a V statistic with a chisquare distribution with three degrees of freedom for these four groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the p-value. Note: The acceptable values of are specific for each basic generator. The test is not applicable for the basic generator WH. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the NIEDERREITER 43 following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.4.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is the set of pvalues pj , j = 1, 2, ..., 10 .The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.4.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-31) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 31 random bits out of each element of generator integer output such that 31 random numbers of 31 bits each have a random enough behavior under this particular test. 8.3.4.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 31x31 Binary Matrices Test cannot be applied to the generator WH as each element of this generator is only 24-bit.Intel(R) MKL Vector Statistical Library Notes 44 8.3.5 Rank of 32x32 Binary Matrices Test 8.3.5.1 Test Purpose The test evaluates the randomness of 32-bit groups of 32 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test performs iterations for all possible 32-bit groups of bits (0-31, 1-32,...) for the generators with the bit precision of more than 32 bits. 8.3.5.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+31 from each element of the integer output. Then it forms a binary matrix 32x32 in size from these 32 groups. The first level test composes 40000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 32, the number of matrices with the rank of 31, the number of matrices with the rank of 30, and the number of matrices with the rank less than 30. For the truly random sequence the probability of composing a 32 rank matrix is 0.289, a 31 rank matrix is 0.578, a 30 rank matrix is 0.128, and a less than 30 rank matrix is 0.005. Therefore, the test divides all possible matrix ranks into four groups. The test makes a V statistics with a chisquare distribution with three degrees of freedom for these three groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the p-value. Note: The acceptable values of are specific for each basic generator. The test is not applicable for basic generators MCG31 and WH. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32.NIEDERREITER 45 The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.5.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is the set of pvalues pj , j = 1, 2, ..., 10 .The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj , j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.5.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-32) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 32 random bits out of each element of generator integer output such that 32 random numbers of 32 bits each have a random enough behavior under this particular test. 8.3.5.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 32x32 Binary Matrices Test cannot be applied to the WH generator as each element of this generator is only 24-bit. The Rank of 32x32 Binary Matrices Test cannot be applied to the MCG31 generator as each element of this generator is only 31-bit. 8.3.6 Rank of 6x8 Binary Matrices Test 8.3.6.1 Test Purpose The test evaluates the randomness of the 8-bit groups of 6 sequential random numbers of the integer output. The stable response is the rank of the binary matrix composed of the random numbers. The test checks all possible 8-bit groups: 0-7, 1-8, ... 8.3.6.2 First Level Test The first level test selects, with s fixed, groups of bits bs, bs+1, ..., bs+7 from each element of the integer output and forms a binary matrix 6x8 in size from these 6 groups. The first level test composes 100000 of such matrices out of sequential elements of the integer output of the generator. Then the test computes the number of matrices with the rank of 6, the number of matrices with the rank of 5, and the number of matrices with the rank less than 5. For the truly random sequence the probability of composing a 6 rank matrix is 0.773, a 5 rank matrix is 0.217, and a less than 5 rank matrix is 0.010. Therefore, the test divides all possible matrix ranks into three groups. The test makes a V statistic with a chi-square distribution with two degrees of freedom for these three groups. Then the first level test applies the chi-square goodness-of-fit test to the groups. The testing result is the pvalue. Note: The acceptable values of are specific for each basic generator. The test checks each of the four elements of the integer output for the WH and SFMT19937 basic generators.Intel(R) MKL Vector Statistical Library Notes 46 BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.6.3 Second Level Test The second level test performs the first level test ten times for the fixed s. The result is a set of pvalues pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained set of pj , j = 1, 2, ..., 10. If the resulting p-value is p<0.05 or p>0.95, the test fails for the s. 8.3.6.4 Final Result Interpretation The second level test performs ten times for each . The test computes the FAIL percentage of the failed second level tests. The final result is the minimal percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-8) for . The acceptable result is the value of FAIL < 50%. Therefore the test indicates whether it is possible to single out at least 8 random bits out of each element of generator integer output such that six random numbers of eight bits each have a random enough behavior under this particular test. 8.3.6.5 Tested Generators Function Name Application vsRngUniform not applicableNIEDERREITER 47 vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The Rank of 6x8 Binary Matrices Test checks each element of the WH generator separately as different multiplicative generators produce its elements. 8.3.7 Count-the-1's Test (Stream of Bits) 8.3.7.1 Test Purpose The test evaluates the randomness of the overlapping random five-letter words sequence. The fiveletter words have the specified distribution of the probabilities of obtaining the specified letter. The test forms the random letters from the integer output of the basic generator. The test regards the integer output as a sequence of bits. 8.3.7.2 First Level Test The first level test assumes that the integer output is a sequence of random bits. The test interprets this bit sequence as a sequence of bytes, that is, a sequence of 8-bit integer numbers. The number of 1’s in every random byte should have a binominal distribution with m = 8, p = 1/2 parameters. Therefore, the probability of getting k 1’s in a byte is equal to . The first level test regards a random variable c that takes five possible values: c = 0, if the number of 1’s in a random byte is less than three, c = 1, if the number of 1’s in a random byte is three, c = 2, if the number of 1’s in a random byte is four, c = 3, if the number of 1’s in a random byte is five, c = 4, if the number of 1’s in a random byte is more than five. The probability distribution of c is the following: The test interprets c as a selection of a random letter from the alphabet {a, b, c, d, e} with the probabilities respectively. Thus, the sequence of random bytes b0, b1, b2, ... corresponds with the defined sequence of random letters l0, l1, l2, ... . The test forms overlapping words of length four: v1 = l1 l2 l3 l4, v2 = l2 l3 l4 l5, ... and length five: w1 = l1 l2 l3 l4 l5, w2 = l2 l3 l4 l5 l6, ... from this sequence. The test computes the frequencies of getting each of 625 of possible four-letter words and of 3,125 of possible five-letter words for 2,560,000 of the obtained words. According to these frequencies, the test makes the chi-square statistics V1 and V2 for the four- and five-letter words respectively. The test takes into account the covariance of the frequencies of the fallouts of four-letter and five-letter words and performs the chi-square test for the V2 -V1 statistic. The V2 -V1 statistic is asymptotically normal with a mean a = 2500 and standard deviation s = 70.71. The result of the first level test is the p-value. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits:Intel(R) MKL Vector Statistical Library Notes 48 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. The test selects only NB of lower bits from each WS-bit integer to form a bit sequence. 8.3.7.3 Second Level Test The second level test performs the first level test ten times. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.7.4 Final Result Interpretation The second level test performs ten times. The test computes the FAIL percentage of the failed second level tests. The acceptable result is the value of FAIL < 50%. 8.3.7.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The WH and SFMT19937 generators use all the four elements to form a bit sequence.NIEDERREITER 49 8.3.8 Count-the-1's Test (Stream of Specific Bytes) 8.3.8.1 Test Purpose The test evaluates the randomness of the overlapping random five-letter words sequence. The fiveletter words have the specified distribution of the probabilities of obtaining the specified letter. The test forms the random letters from the integer output of the basic generator. The test selects only 8 sequential bits from each element, starting with a certain fixed bit s. 8.3.8.2 First Level Test The test selects the ds, ds+1, ..., ds+7 bits determining the next random byte from each element of the integer output, where (see the table below). The number of 1’s in every random byte should have a binominal distribution with m = 8, p = 1/2 parameters. Therefore, the probability of getting k 1’s in a byte is equal to . The first level test regards a random number that takes five possible values: c = 0, if the number of 1’s in a random byte is less than three, c = 1, if the number of 1’s in a random byte is three, c = 2, if the number of 1’s in a random byte is four, c = 3, if the number of 1’s in a random byte is five, c = 4, if the number of 1’s in a random byte is more than five. The probability distribution of c is the following: . The test interprets c as a selection of a random letter from the alphabet {a, b, c, d, e} with the respective probabilities . Thus, the sequence of random bytes b0, b1, b2, ... corresponds with the defined sequence of random letters l0, l1, l2, ... . The test forms overlapping words of length four: v1 = l1 l2 l3 l4, v2 = l2 l3 l4 l5, ... and length five: w1 = l1 l2 l3 l4 l5, w2 = l2 l3 l4 l5 l6, ... from this sequence. The test computes the frequencies of getting each of 625 of possible four-letter words and of 3,125 of possible five-letter words for 256,000 of the obtained words. According to these frequencies, the test makes the chi-square statistics V1 and V2 for the four- and five-letter words respectively. The test takes into account the covariance of the frequencies of the fallouts of four-letter and five-letter words and performs the chi-square test for the V2 -V1 statistic. The V2 -V1 statistic is asymptotically normal with a mean a = 2500 and standard distribution s = 70.71. The result of the first level test is the p-value. BRNG Integer Output Interpretation MCG31m1 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-30. NB=31, WS=32. R250 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MRG32k3a Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MCG59 Array of 64-bit integers. Each 64-bit integer uses the Intel(R) MKL Vector Statistical Library Notes 50 following bits: 0-58. NB=59, WS=64. WH Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-23. NB=24, WS=32. MT19937 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. MT2203 Array of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. SFMT19937 Array of quadruples of 32-bit integers. Each 32-bit integer uses the following bits: 0-31. NB=32, WS=32. 8.3.8.3 Second Level Test The second level test performs the first level test ten times for the fixed . The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails for s. 8.3.8.4 Final Result Interpretation The second level test performs ten times for each of 0 £ s £ NB-8. The test computes the FAIL percentage of the failed second level tests. The final result is the minimal for percentage of the failed tests FAIL = min(FAIL0, FAIL1, ..., FAILNB-8). The acceptable result is the value of FAIL < 50%. Therefore, the test determines whether it is possible to select at least 8 random bits from each element of the integer output of the generator. 8.3.8.5 Tested Generators Function Name Application vsRngUniform not applicable vdRngUniform not applicable viRngUniform not applicable viRngUniformBits applicable The test checks each of the four elements separately for the WH and SFMT19937 generators. 8.3.9 Craps Test 8.3.9.1 Test Purpose The test evaluates the randomness of the output sequence of random numbers of the uniform distribution that imitates the process of dice tossing when gambling Craps. The stable response is the number of tosses of the pair of dice necessary to complete the game and the frequency of wins in the game.NIEDERREITER 51 8.3.9.2 First Level Test The test forms a sequence of random numbers equiprobably taking the values from 1 to 6 from the output sequence of random numbers. The test treats every number as a number of spots on the face of a die. Thus the test regards a pair of numbers as the result of a toss of two dice. If on the first throw of dice the sum of the spots on the faces of dice equals to 7 or 11, it is a win; if the sum equals 2, 3 or 12, it is a loss. In other cases it is necessary to make additional throws to define the result of the game. The test performs additional throws until the sum of the spots equals to 7 or coincides with the sum thrown on the first throw. If the sum equals to 7, it is a loss, otherwise, it is a win. The theoretical probability of the win is 244/495, that is, a little less than 0.5. Further, the frequency of wins with the K-multiple repeats of the game, when K = 200,000, has a very close to normal distribution with mean a = K*244/495 and standard deviation s = a*251/495. The number of throws necessary to complete the game can take the 1,2, ... values. On K-multiple iterations of the game, the test computes the frequencies of getting c = 1, c = 2, ..., c = 20, c > 20. Based on these frequencies, the test makes the chi-square statistics V with the chi-square distribution with 20 degrees of freedom. The result of the first level test is the pair of p-values p and q for the number of tosses and the frequency of wins respectively. 8.3.9.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the pair of p-values pj and qj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting pvalue is p < 0.05 or p > 0.95, the test fails. Similarly, the test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of qj , j = 1, 2, ..., 10. If the resulting p-value is q < 0.05 or q > 0.95, the s test fails. The test passes in all other cases. 8.3.9.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.9.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform applicable viRngUniformBits applicable 8.3.10 Parking Lot Test 8.3.10.1 Test Purpose The test evaluates the randomness of two-dimensional random points uniformly distributed in the square with a side of length 100. The stable response is the number of successfully ”parked” points from the 12,000 random two-dimensional points.Intel(R) MKL Vector Statistical Library Notes 52 8.3.10.2 First Level Test The test assumes a next random point (x, y) successfully ”parked”, if it is far enough from every previous successfully ”parked” point. The sufficient distance between the points (x1, y1) and (x2, y2) is . Numerous experiments prove that out of 12,000 of truly random points only 3,523 points park successfully in average. Moreover, the K value of points successfully parked after 12,000 attempts haves close to normal distribution with mean a = 3,523 and standard deviation s = 21.9. Consequently, (K-a)/s should have a close to standard normal distribution with the cumulative distribution function. The result of the test is the p-value . 8.3.10.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the p-value pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.10.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.10.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.3.11 2D Self-Avoiding Random Walk Test 8.3.11.1 Test Purpose The test evaluates the randomness of the output vector of the generator. The stable response is the frequency of achieving the upper side of the lattice by the point walking randomly along the sites. 8.3.11.2 First Level Test A random particle walks along the sites of a square lattice. With each new step, the particle moves in one of possible directions one step forward corner-wise. A square lattice has two types of sides: the lower and left-hand sides are totally reflecting, while the upper and right-hand sides are totally adsorbing. Reaching the lower and left-hand sides, the vector of the movement direction makes a 90- degree bend. The upper and right-hand sides adsorb the particle when it reaches them and the walking process completes. The particle starts its movement from the lower left-hand site of the lattice in the northeast direction. If the particle encounters an unvisited site, it changes the direction vector with a ½ probability clockwise or counter-clockwise by 90 degrees and continues the walking process. If the particle encounters an already visited site of the lattice, it defines the movement direction according to the conditions of inadmissibility of re-tracing at least a part ?f the passed path. Due to the symmetry of the task, either upper or the right-hand side should equiprobably adsorb the particle. The test determines the frequency of the achievement of the upper side of the lattice by the result of 500 iterations of the walking process. If M is the number of attempts when the particle NIEDERREITER 53 reaches the upper side, then has the close to standard normal distribution . The result of the first level test is the p-value . 8.3.11.3 Second Level Test The test performs the first level test ten times. The result of each iteration of the first level test is the p-value pj , j = 1, 2, ..., 10. The test applies the Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling statistics to the obtained p-values of pj , j = 1, 2, ..., 10. If the resulting p-value is p < 0.05 or p > 0.95, the test fails. 8.3.11.4 Final Result Interpretation The final result of the test is the percentage FAIL of the failed second level tests. The test performs the second level test ten times. The acceptable result is the value of FAIL < 50%. 8.3.11.5 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.3.12 Template Test 8.3.12.1 Test Purpose The test evaluates the conformity of the generator output with the template sequence of random numbers. The test forms the specified output integer sequence from the recurrence specifying initial conditions. The parameters of the recurrences are selected such that the output sequences possess "good" properties (good multidimensional uniformity, large period, etc.). If the test computes any member of sequence incorrectly, that results in incorrect computing of the other members of the sequence. Moreover, if differs from the correct (template) sequence in one bit, the subsequent members of sequence may differ significantly from the template sequence. In this connection the quality of the obtained sequence is highly probable to be much worse than the quality of the template sequence. That is why all the basic generators of the VSL undergo thorough tests for template sequences conformity. The test also checks the basic generators with the random output numbers , uniformly distributed over the interval for the template output conformity. Obviously, the output sequences are different for real arithmetic of single and double precision. Other from the integer output where every member should coincide bitwisely with the template member, it is not necessary for the real output members. The lower bits of mantissa of the real output do not influence randomness, these are the upper bits that determine the quality of the output sequence. For example, the coincidence of the upper binary digits of mantissa is sufficient enough for most applications. (See the chapter Spectral Test in [Knuth81]). This test is also used to validate VSL basic quasi-random number generatorsIntel(R) MKL Vector Statistical Library Notes 54 8.3.12.2 Final Result Interpretation The final result is the number of the sequence members that do not coincide with the template members. The value should be equal to 0. For real sequences the test assumes that the sequence member coincides with the template member, if at least 8 upper binary digits of mantissa coincide. 8.3.12.3 Tested Generators Function Name Application vsRngUniform applicable vdRngUniform applicable viRngUniform not applicable viRngUniformBits applicable 8.4 BRNG Properties and Testing Results This section contains the empirical testing results for the VSL basic generators described in the BRNG Tests Description section and other information on the properties of basic generators and the rules of the output vector interpretation. 8.4.1 MCG31m1 This is a 31-bit multiplicative congruential generator: MCG31m1 belongs to linear congruential generators with the period length of approximately 2 32 . Such generators are still used as default random number generators in various software systems, mainly due to the simplicity of the portable versions implementation, speed and compatibility with the earlier systems versions. However, their period length does not meet the requirements for modern basic generators. Still, the MCG31m1 generator possesses good statistic properties and you may successfully use it to generate random numbers of different distributions for small samplings. 8.4.1.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.1.2 Integer Implementation The output vector of 32-bit integers 8.4.1.3 Stream Initialization by Function vslNewStream MCG31m1 generates the stream and initializes it specifying the input 32-bit parameter seed :NIEDERREITER 55 • Assume x0 = seed mod 0x7FFFFFFF • If x0 = 0, assume x0 = 1. 8.4.1.4 Stream Initialization by Function vslNewStreamEx MCG31m1 generates the stream and initializes it specifying the array n of 32-bit integers params[]: • If n = 0, assume x0 = 1 • Otherwise, assume x0 = params[0] mod 0x7FFFFFFF If x0 = 0, assume x0 = 1. 8.4.1.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported 8.4.1.6 Generator Period . 8.4.1.7 Lattice Structure M8 = 0.72771, M16 = 0.61996, M32 = 0.61996 (for more details see [L’Ecu94]). 8.4.1.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A N/A Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the- 1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the- 1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors)Intel(R) MKL Vector Statistical Library Notes 56 Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D SelfAvoiding Random Walk Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.2 R250 This is a generalized feedback shift register generator: Feedback shift register generators possess ample theoretical foundation and first were intended for cryptographic and communication applications. The physicists widely use R250 generator, as it is simple and fast in implementation. However, this generator fails in some types of tests, one of which is the 2D Self-Avoiding Random Walk Test. 8.4.2.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.2.2 Integer Implementation The output vector of 32-bit integers 8.4.2.3 Stream Initialization by Function vslNewStream R250 generates the stream and initializes it specifying the input 32-bit integer parameter seed. The stream state is the array of 250 32-bit integers , initialized in the following way: • If seed = 0, assume seed = 1. Assume x-250 = seed. • Initialize according to recurrent correlation . • Interpret the values as a binary matrix of size 32x32 and perform the following: set the diagonal bits to 1, and the under-diagonal bits to 0. 8.4.2.4 Stream Initialization by Function vslNewStreamEx R250 generates the stream and initializes it specifying the array n of 32-bit integer params[]: NIEDERREITER 57 • If n = 0, assume xk-250 = params[k], k=0,1,...,249. If n = 0, assume seed = 1, and perform the initialization as described in the above section on stream initialization by the function vslNewStream. 8.4.2.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.2.6 Generator Period . 8.4.2.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (25% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (30% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test FAIL (70% errors) FAIL (80% errors) N/A FAIL (80% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777.Intel(R) MKL Vector Statistical Library Notes 58 8.4.3 MRG32k3a This is a 32-bit combined multiple recursive generator with 2 components of order 3: MRG32k3a combined generator meets the requirements for modern RNGs, such as good multidimensional uniformity, long period, etc. Optimization for various Intel® architectures makes it competitive with the other VSL basic generators in terms of speed. 8.4.3.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.3.2 Integer Implementation The output vector of 32-bit integers 8.4.3.3 Stream Initialization by Function vslNewStream MRG32k3a generates the stream and initializes it specifying the 32-bit input integer parameter seed. The stream state is the two triplets of 32-bit integers ( and ), initialized in the following way: • Assume x-3 = seed. • Assume the other values equal to 1, that is, . 8.4.3.4 Stream Initialization of Function vslNewStreamEx MRG32k3a generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume . • If n = 1, assume x-3 = params[0] mod m1, . • If n = 2, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, . • If n = 3, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. • If n = 4, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1.NIEDERREITER 59 • If n = 5, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, y-2 = params[4] mod m2, . If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. • If n = 6, assume x-3 = params[0] mod m1, x-2 = params[1] mod m1, x-1 = params[2] mod m1, y-3 = params[3] mod m2, y-2 = params[4] mod m2, y-1 = params[5] mod m2. If the values prove to be x-3 = x-2 = x-1 = 0, assume x-3 = 1. If the values prove to be y-3 = y-2 = y-1 = 0, assume y-3 = 1. 8.4.3.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream not supported 8.4.3.6 Generator Period . 8.4.3.7 Lattice Structure M8 = 0.68561, M16 = 0.63940, M32 = 0.63359. 8.4.3.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (20% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (20% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D Self-Avoiding Random Walk Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. Intel(R) MKL Vector Statistical Library Notes 60 • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.4 MCG59 This is a 59-bit multiplicative congruential generator: Multiplicative congruential generator MCG59 is one of the two basic generators implemented in the NAG Numerical Libraries. As the module of the generator is not prime, the length of its period is not 2 59 but only 2 57 , if the initial value (seed) is not an even number. The drawback of these generators is well known, (see, for example, [Cram46], [Ent98]): the lower bits of the generated sequence of pseudo-random numbers are not random and thus breaking numbers down into their bit patterns and using individual bits may cause trouble. Besides, block-splitting an entire period sequence into 2d identical blocks leads to their full identity in d lower bits. 8.4.4.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.4.2 Integer Implementation The output vector of the 32-bit integers is Thus, the output vector stores practically every 59-bit member of the integer output as two 32-bit integers. For example, to get a vector from n 59-bit integers the size of the output array should be large enough to store 2n 32-bit numbers. 8.4.4.3 Stream Initialization by Function vslNewStream MCG59 generates the stream and initializes it specifying the 32-bit input integer parameter seed. • Assume x0 = seed mod 2 59 . • If x0 = 0, assume x0 = 1. 8.4.4.4 Stream Initialization of Function vslNewStreamEx MCG59 generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume x0 = 1. • If n = 1, assume seed = params[0], follow the instructions described in the above section on stream initialization by the function vslNewStream. • Otherwise assume seed = params[0]+2 32 *params[1], follow the instructions described in the above section on stream initialization by the function vslNewStream. 8.4.4.5 Subsequences Selection Methods vslSkipAheadStream supportedNIEDERREITER 61 vslLeapfrogStream supported 8.4.4.6 Generator Period . 8.4.4.7 Lattice Structure S2 = 0.84; S3 = 0.73; S4 = 0.74; S5 = 0.58; S6 = 0.63; S7 = 0.52; S8 = 0.55; S9 = 0.56. 8.4.4.8 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (10% errors) OK (10% errors) N/A OK (10% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (45% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A FAIL (100% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (10% errors) OK (10% errors) OK (10% errors) OK (10% errors) Parking Lot Test OK (20% errors) OK (20% errors) N/A OK (20% errors) 2D Self-Avoiding Random Walk Test OK (20% errors) OK (10% errors) N/A OK (10% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. [1] The generator fails the test for bit groups 0-23, 1-24, 2-25, 3-26, 5-28. [2] The generator fails the test for bit groups 0-30, 1-31. [3] The generator fails the test for bit groups 0-31, 1-32.Intel(R) MKL Vector Statistical Library Notes 62 [4] The generator fails the test for bit groups 0-7, ..., 9-16, 11-18, 32-39, ..., 37-44, 39-46, ..., 41- 48. [5] The generator fails the test for bit groups 0-7, …, 11-18, 13-20, …, 15-22. 8.4.5 WH This is a set of 273 Wichmann-Hill’s combined multiplicative congruential generators (j = 1, 2, ..., 273): WH is a set of 273 different basic generators. This generator is the second basic generator in the NAG libraries. The constants ai,j range from 112 to 127, the constants mi,j are prime numbers ranging from 16,718,909 to 16,776,971, close to 2 24 . These constant should show good results in the spectral test (see Knuth [Knuth81] and MacLaren [MacLaren89]). The period of each Wichmann-Hill generator may be equal to 2 92 if not for common factors between (m1,j -1), (m2,j -1), (m3,j -1) and (m4,j -1). However, each generator should still have a period of at least 2 80 . The generated pseudo-random sequences are essentially independent of one another according to the spectral test (for detailed information about properties of these generators see [MacLaren89]). 8.4.5.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.5.2 Integer Implementation The output vector of 32-bit integers Thus, the output vector stores practically every quadruple (x, y, z, w) of members of the integer output as four 32-bit integers. For example, to get a vector from n quadruples (x, y, z, w), the size of the output array should be large enough to for storage of 4n 32-bit numbers. 8.4.5.3 Stream Initialization by Function vslNewStream WH generates the stream and initializes it specifying the 32-bit input integer parameter seed : • Assume x0 = seed mod m1. If x0 = 0, assume x0 = 1. • Assume y0 = 1, z0 = 1, w0 = 1. WH generator is a set of 273 basic generators. The test selects a WH generator adding an offset to the named constant VSL_BRNG_WH: VSL_BRNG_WH+0, VSL_BRNG_WH+1, ... , VSL_BRNG_WH+272. The following example illustrates the initialization of the seventh (of 273) WH generator: vslNewStream (&stream, VSL_BRNG_WH+6, seed); 8.4.5.4 Stream Initialization of Function vslNewStreamEx WH generates the stream and initializes it specifying the array n of 32-bit integer params[]: • If n = 0, assume x0 = 1, y0 = 1, z0 = 1, w0 = 1.NIEDERREITER 63 • If n = 1, assume x0 = params[0] mod m1, y0 = 1, z0 = 1, w0 = 1. If x0 = 0, assume x0 =1. • If n = 2, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0 = 1, w0 = 1. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. • If n = 3, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0 = params[2] mod m3, w0 = 1. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. If z0 = 0, assume z0 = 1. • If n = 4, assume x0 = params[0] mod m1, y0 = params[1] mod m2, z0= params[2] mod m3, w0 = params[3] mod m4. If x0 = 0, assume x0 = 1. If y0 = 0, assume y0 = 1. If z0 = 0, assume z0 = 1. If w0 = 0, assume w0 = 1. 8.4.5.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported 8.4.5.6 Generator Period . 8.4.5.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A FAIL (60% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A N/A Rank of 32x32 Binary Matrices Test N/A N/A N/A N/A Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (10% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (10% errors) Parking Lot Test OK (10% errors) OK (10% errors) N/A OK (10% errors) 2D Self-Avoiding Random Walk Test OK (10% errors) OK (0% errors) N/A OK (20% errors) Note: • N/A means that the test is not applicable to this function. Intel(R) MKL Vector Statistical Library Notes 64 • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.6 MT19937 This is a Mersenne Twister pseudorandom number generator: , , , , , , . Matrix A (32x32) has the following format: , Where the 32-bit vector has the value . Mersenne Twister pseudorandom number generator MT19937 is a modification of twisted generalized feedback shift register generator [Matsum92], [Matsum94]. MT19937 has the period length of 2 19937 -1 and is 623-dimensionally equidistributed up to 32-bit accuracy. These properties make the generator applicable for simulations in various fields of science and engineering. The initialization procedure is essentially the same as described in [MT2002]. The state of the generator is represented by 624 32- bit unsigned integer numbers. 8.4.6.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.6.2 Integer Implementation The output vector of 32-bit integersNIEDERREITER 65 8.4.6.3 Stream Initialization by Function vslNewStream MT19937 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 624 32-bit integers , is initialized by the procedure described in [MT2002] and based on the seed value. 8.4.6.4 Stream Initialization of Function vslNewStreamEx MT19937 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described in [MT2002] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.6.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.6.6 Generator Period . 8.4.6.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (0% errors) OK (0% errors) N/A OK (0% errors) Birthday Spacing Test N/A N/A N/A OK (10% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (20% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (30% errors) OK (30% errors) OK (30% errors) OK (30% errors) Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (0% errors) OK (10% errors) N/A OK (10% errors) Note:Intel(R) MKL Vector Statistical Library Notes 66 • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.7 SFMT19937 This is a SIMD-oriented Fast Mersenne Twister pseudorandom number generator: where , , ... are 128-bit integers, and the , , , operations are defined as follows: , left shift of 128-bit integer by followed by exclusive-or operation , right shift of each 32-bit integer in quadruple followed by and-operation with quadruple of 32-bit masks , mask=(0xBFFFFFF6 0xBFFAFFFF 0xDDFECB7F 0xDFFFFFFEF) , right shift of 128-bit integer , left shift of each 32-bit integer in quadruple , k-th 32-bit integer in quadruple , . 8.4.7.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.7.2 Integer Implementation The output vector of 32-bit integers , kth 32-bit integer member of quadruple . Thus, the output vector stores practically every quadruple, 128-bit integer of members of the integer output as four 32-bit integers. For example, to get a vector from n quadruples , the size of the output array should be large enough to store 4n 32-bit numbers. 8.4.7.3 Stream Initialization by Function vslNewStream SFMT19937 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 156 128-bit integers (624 32-bit integers ), is initialized by the procedure described in [Saito08] and based on the seed value. NIEDERREITER 67 8.4.7.4 Stream Initialization of Function vslNewStreamEx SFMT19937 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described [Saito08] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.7.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.7.6 Generator Period . 8.4.7.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (30% errors) OK (30% errors) N/A OK (40% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (10% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (10% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (10% errors) Parking Lot Test OK (30% errors) OK (30% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (0% errors) OK (20% errors) N/A OK (10% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95].Intel(R) MKL Vector Statistical Library Notes 68 • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.8 MT2203 This is a set of 6024 Mersenne Twister pseudorandom number generators (j = 1, ..., 6024): , , , , , , . Matrix (32x32) has the following format: , with the 32-bit vector . The set of 6024 basic pseudorandom number generators MT2203 is a natural addition to MT19937 generator. MT2203 generators are intended for use in large scale Monte Carlo simulations performed on multi-processor computer systems. These generators possess a smaller period length but the number of 2 2203 -1 is big enough to meet the requirements of modern Monte Carlo problems. MT2203 produces up to 6024 independent random number sequences. The parameters have been carefully chosen according to the method described in [Matsum2000]. 8.4.8.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values 8.4.8.2 Integer Implementation The output vector of 32-bit integersNIEDERREITER 69 8.4.8.3 Stream Initialization by Function vslNewStream MT2203 generates the stream and initializes it specifying the input 32-bit unsigned integer parameter seed. The stream state, that is, the array of 69 32-bit integers , is initialized by the procedure described in [MT2002] and based on the seed value. MT2203 generator is a set of 6024 basic generators. To select an MT2203 generator, add an offset to the named constant VSL_BRNG_MT2203, for example, VSL_BRNG_MT2203+0, VSL_BRNG_ MT2203+1, ... . The following example illustrates the initialization of the 10th (of 6024) MT2203 generator: vslNewStream (&stream, VSL_BRNG_MT2203+9, seed); 8.4.8.4 Stream Initialization of Function vslNewStreamEx MT2203 generates the stream and initializes it specifying the array n of 32-bit unsigned integer params[]: • If n = 1, perform initialization as described in [MT2002] using array params[]on input. • If n = 0, assume params[0] = 1, n = 1 and perform initialization as described in the previous item. 8.4.8.5 Subsequences Selection Methods vslSkipAheadStream not supported vslLeapfrogStream not supported 8.4.8.6 Generator Period . 8.4.8.7 Empirical Testing Results Summary Test Name vsRngUniform vdRngUniform viRngUniform viRngUniformBits 3D Spheres Test OK (20% errors) OK (20% errors) N/A OK (20% errors) Birthday Spacing Test N/A N/A N/A OK (0% errors) Bitstream Test N/A N/A N/A OK (15% errors) Rank of 31x31 Binary Matrices Test N/A N/A N/A OK (10% errors) Rank of 32x32 Binary Matrices Test N/A N/A N/A OK (0% errors) Rank of 6x8 Binary Matrices Test N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of bits) N/A N/A N/A OK (0% errors) Counts-the-1’s Test (stream of specific bytes) N/A N/A N/A OK (0% errors) Craps Test OK (20% errors) OK (20% errors) OK (20% errors) OK (20% errors)Intel(R) MKL Vector Statistical Library Notes 70 Parking Lot Test OK (0% errors) OK (0% errors) N/A OK (0% errors) 2D Self-Avoiding Random Walk Test OK (10% errors) OK (0% errors) N/A OK (0% errors) Note: • N/A means that the test is not applicable to this function. • The tabulated data is obtained using the one-level (threshold) testing technique. The OK result indicates FAIL < 50%, that is, when FAILs occur in less than 5 runs out of 10. The run is failed when p-value falls outside the interval [0.05, 0.95]. • The stream tested is generated by calling the function vslNewStream with seed=7,777,777. 8.4.9 SOBOL This is a 32-bit Gray code-based quasi-random number generator Note: The value c is the rightmost zero bit in n-1; is s-dimensional vector of 32-bit values. The sdimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . Bratley and Fox [Brat87] provide an implementation of the SOBOL quasi-random number generator. VSL implementation allows generating SOBOL’s low-discrepancy sequences of length up to 232. This implementation also admits registration of user-defined parameters (direction numbers and primitive polynomials) during the initialization, which allows obtaining quasi-random vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasirandom vectors. The default dimension of quasi-random vectors can vary from 1 to 40 inclusive. 8.4.9.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values , where elements correspond to the , correspond to the , and so on. 8.4.9.2 Integer Implementation The output vector of 32-bit integers , where elements correspond to the , correspond to the , and so on. 8.4.9.3 Stream Initialization by Function vslNewStream SOBOL generates the stream and initializes it specifying the input 32-bit parameter seed (dimension dimen of a quasi-random vector): • Assume dimen = seed • If dimen < 1 or dimen > 40, assume dimen = 1.NIEDERREITER 71 8.4.9.4 Stream Initialization by Function vslNewStreamEx SOBOL generates the stream and initializes it specifying the array params[] of n 32-bit integers to set the dimension dimen of a quasi-random vector as well as pass other generator related parameters, for example, initial direction numbers and primitive polynomials. Direction numbers can also be passed using the array. General interface for passing stream initialization parameters of SOBOL via the params[]array has the following format: Position in params[] 0 1 2 3...2+dimen 3+dimen 4+dimen...dimen* (maxdeg+1)+3 dimen Parameter Class Indicators Initial Values Subclass Indicators Primitive polynomials Maximum degree of primitive polynomial, maxdeg Initial direction numbers The dimension parameter params[0] is obligatory, and can be initialized as follows: params[0] = dimen; The other elements of params intended for passing additional user-supplied data are optional. For example, if they are not presented, then default tables of direction numbers are used for generation of quasi-random vectors. VSL default tables of direction numbers allow generating quasi-random sequences for dimensions up to 40. If you want to generate quasi-random vectors of greater dimension or obtain another sequence you may register a set of your own primitive polynomials and/or a table of initial direction numbers. In order to do this, you need to set the Parameter Class Indicators field (params[1]) to VSL_USER_QRNG_INITIAL_VALUES: params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you should specify in Initial Values Subclass Indicators field (params[2]) whether you want to supply primitive polynomials, initial direction numbers, or both, by setting corresponding indicators. In the example below both direction numbers and primitive polynomials indicators are set: params[2] = VSL_USER_INIT_DIRECTION_NUMBERS | VSL_USER_PRIMITIVE_POLYMS; If you want to provide just initial direction numbers, do it as follows: params[2] = VSL_USER_INIT_DIRECTION_NUMBERS; Similarly you can indicate that only primitive polynomials are passed to the library: params[2] = VSL_USER_PRIMITIVE_POLYMS; Note: For dimensions greater than 40, both the primitive polynomials and the table of initial direction numbers must be provided. Remainder of the params array is used to pass primitive polynomials and/or initial direction numbers. Primitive polynomials are packed as unsigned integers, initial direction numbers for SOBOL are assumed to be two-dimensional table. In the matrix i-th row corresponds to i-th dimension, and number of columns equals the maximum degree of primitive polynomials maxdeg. The number of polynomials (and the number of rows in the table) depends on the initialization mode for the first dimension. In the default initialization mode (see [Brat88] for details) it is enough to pass into the library dimen -1 primitive polynomials (correspondingly, the number of rows in the table of initial direction numbers also equals dimen -1). To override default initialization for the first dimension, set VSL_QRNG_OVERRIDE_1ST_DIM_INIT indicator in params[2]: params[2] = params[2] | VSL_QRNG_OVERRIDE_1ST_DIM_INIT; and pass a complete set of polynomials and/or initial direction numbers (dimen primitive polynomials and the table of initial direction numbers with dimen rows). If you pass just primitive polynomials or Intel(R) MKL Vector Statistical Library Notes 72 initial direction numbers for dimensions , the default initialization for the first dimension is always assumed (the number of polynomials and the number of rows in the table of initial direction numbers equals s-1). If both arrays are passed to the generator you should organize data in correct order: first - polynomials, second - maximum degree of primitive polynomials and, finally, initial direction numbers as it is done in the example below: unsigned int uSobolIrredPoly[dimen] = {...}; unsigned int uSobolMInit[dimen][maxdeg] = {...}; ... params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_INIT_DIRECTION_NUMBERS|VSL_USER_PRIMITIVE_POLYMS; params[2] = params[2] | VSL_QRNG_OVERRIDE_1ST_DIM_INIT; for ( i = 0; i < dimen; i++ ) params[i+3] = uSobolIrredPoly[i]; params[3+dimen] = maxdeg; k = 4+dimen; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < maxdeg; j++ ) { params[k++] = uSobolMInit[i][j]; } } Replacement of default initial values for SOBOL with user-provided values can be done as shown in the example below: ... // dimen = 10 unsigned int uSobolMInit[dimen-1][maxdeg] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_INIT_DIRECTION_NUMBERS; params[3] = maxdeg; k = 4; for ( i = 0; i < dimen-1; i++ ) { for ( j = 0; j < maxdeg; j++ ) { params[k++] = uSobolMInit[i][j]; } } You can also calculate a table of direction numbers using your own initial direction numbers and primitive polynomials and pass this array to the generator. The interface for registration of the direction numbers is as follows: NIEDERREITER 73 Position in params[] 0 1 2 3...dimen*32+2 dimen Parameter Class Indicators Initial Values Subclass Indicators Direction numbers As earlier, the dimension parameter params[0] and Parameter Class Indicators field (params[1]) can be initialized as follows: params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you need to initialize Initial Values Subclass Indicators field (params[2]): params[2] = VSL_USER_DIRECTION_NUMBERS; Direction numbers are assumed to be dimen x 32 table of unsigned integers and can be passed to the generator in the following way: unsigned int uSobolV[dimen][32] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_DIRECTION_NUMBERS; k = 3; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < 32; j++ ) { params[k++] = uSobolV[i][j]; } } In short, the SOBOL stream initialization is as follows: If n = 0, assume dimen = 1 If n = 1, dimen = params[0] • If dimen < 1 or dimen > 40, assume dimen = 1. If n > 1, initialize SOBOL quasi-random stream by means of user-defined primitive polynomials and initial direction numbers or direction numbers. • If externally defined parameters of the generator are packed incorrectly, initialize stream using default tables of direction numbers. 8.4.9.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported Note:Intel(R) MKL Vector Statistical Library Notes 74 • The skip-ahead method skips individual components of quasi-random vectors rather than whole s-dimensional vectors. Hence, to skip N s-dimensional quasi-random vectors, call vslSkipAheadStream subroutine with parameter nskip equal to the N×s. • The leapfrog method works with individual components of quasi-random vectors rather than with s-dimensional vectors. In addition, its functionality allows picking out a fixed quasirandom component only. In other words, nstreams parameter should be equal to the predefined constant VSL_QRNG_LEAPFROG_COMPONENTS, and k parameter should indicate the index of a component of s-dimensional quasi-random vectors to be picked out (0 = k < s). 8.4.9.6 Generator Period . 8.4.9.7 Dimensions is a default set of dimensions; user-defined dimensions are available. 8.4.10 NIEDERREITER This is a 32-bit Gray code-based quasi-random number generator Note: The value c is the rightmost zero bit in n-1; is s-dimensional vector of 32-bit values. The sdimensional vectors (calculated during random stream initialization) are called direction numbers. The vector is the generator output normalized to the unit hypercube . According to the results of Bratley, Fox, and Niederreiter [Brat92] Niederreiter sequences have the best known theoretical asymptotic properties. VSL implementation allows generating Niederreiter lowdiscrepancy sequences of length up to 2 32 . This implementation also allows for registration of userdefined parameters (irreducible polynomials or direction numbers), which allows obtaining quasirandom vectors of any dimension. If user does not supply user-defined parameters, the default values are used for generation of quasi-random vectors. The default dimension of quasi-random vectors can vary from 1 to 318 inclusive. 8.4.10.1 Real Implementation (Single and Double Precision) The output vector is the sequence of the floating-point values , where elements correspond to the , correspond to the , and so on. 8.4.10.2 Integer Implementation The output vector of 32-bit integers , where elements correspond to the , correspond to the , and so on. 8.4.10.3 Stream Initialization by Function vslNewStream NIEDERREITER generates the stream and initializes it specifying the input 32-bit parameter seed (dimension dimen of a quasi-random vector):NIEDERREITER 75 • Assume dimen = seed • If dimen < 1 or dimen > 318, assume dimen = 1. 8.4.10.4 Stream Initialization by Function vslNewStreamEx NIEDERREITER generates the stream and initializes it specifying the array params[] of n 32-bit integers to set the dimension dimen of a quasi-random vector as well as pass other generator related parameters, for example, irreducible polynomials or direction numbers (matrix of the generator). General interface for passing stream the polynomials via the params[] array has the following format: Position in params[] 0 1 2 3...2+dimen dimen Parameter Class Indicators Initial Values Subclass Indicators Irreducible polynomials The dimension parameter params[0] is obligatory, and can be initialized as follows: params[0] = dimen; The other elements of params intended for passing additional user-supplied data are optional. For example, if they are not presented, then the default table of irreducible polynomials is used for generation of quasi-random vectors. VSL default tables of the polynomials allow generating quasirandom sequences for dimensions up to 318. If you want to generate quasi-random vectors of greater dimension or obtain another sequence you may register a set of your own irreducible polynomials. In order to do this, you need to set the Parameter Class Indicators field (params[1]) to VSL_USER_QRNG_INITIAL_VALUES: params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you should indicate in Initial Values Subclass Indicators field (params[2]) that you want to supply irreducible polynomials: params[2] = VSL_USER_IRRED_POLYMS; Remainder of the params array is used to pass irreducible polynomials. They are packed as unsigned integers and serially set into corresponding positions of the params array as it is shown in the example below (number of the polynomials equals the dimension dimen): unsigned int uNiederrIrredPoly[dimen] = {...}; ... params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_IRRED_POLYMS; for ( i = 0; i < dimen; i++ ) params[i+3] = uNiederrIrredPoly[i]; You can also calculate direction numbers (matrix of the generator) using your own irreducible polynomials and pass this table to the generator. The interface for registration of the direction numbers is as follows: Position in params[] 0 1 2 3...dimen*32+2Intel(R) MKL Vector Statistical Library Notes 76 dimen Parameter Class Indicators Initial Values Subclass Indicators Direction numbers As earlier, the dimension parameter params[0] and Parameter Class Indicators field (params[1]) can be initialized as follows: params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; Further, you need to initialize Initial Values Subclass Indicators field (params[2]): params[2] = VSL_USER_DIRECTION_NUMBERS; Direction numbers are assumed to be dimen x 32 table of unsigned integers and can be passed to the generator in the following way: unsigned int uNiederrCJ[dimen][32] = {...}; params[0] = dimen; params[1] = VSL_USER_QRNG_INITIAL_VALUES; params[2] = VSL_USER_DIRECTION_NUMBERS; k = 3; for ( i = 0; i < dimen; i++ ) { for ( j = 0; j < 32; j++ ) { params[k++] = uNiederrCJ[i][j]; } } In short, NIEDERREITER stream initialization is as follows: • If n = 0, assume dimen = 1 • If n = 1, dimen = params[0] If dimen < 1 or dimen > 318, assume dimen = 1. • If n > 1, initialize NIEDERREITER quasi-random stream by means of user-defined polynomials If externally defined parameters of the generator are packed incorrectly, initialize stream by setting dimension to 1 and using default tables of irreducible polynomials. 8.4.10.5 Subsequences Selection Methods vslSkipAheadStream supported vslLeapfrogStream supported Note: • The skip-ahead method skips individual components of quasi-random vectors rather than whole s-dimensional vectors. Hence, to skip N s-dimensional quasi-random vectors, call vslSkipAheadStream subroutine with parameter nskip equal to the N×s. • The leapfrog method works with individual components of quasi-random vectors rather than with s-dimensional vectors. In addition, its functionality allows picking out a fixed quasirandom component only. In other words, nstreams parameter should be equal to the NIEDERREITER 77 predefined constant VSL_QRNG_LEAPFROG_COMPONENTS, and k parameter should indicate the index of a component of s-dimensional quasi-random vectors to be picked out (0 = k < s). 8.4.10.6 Generator Period . 8.4.10.7 Dimensions is a default set of dimensions; user-defined dimensions are available.78 9 Testing of Distribution Random Number Generators Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 VSL generators are tested with a testing suite comprising a set of tests to control the quality of random number sequences of general discrete and continuous distributions. Random numbers of discrete and continuous distributions are generated by transforming random numbers of uniform distribution. A source of uniformly distributed random numbers is a random stream produced by a basic generator. Quality of the random number sequences with non-uniform distribution greatly depends on the quality of the respective basic generator. Therefore, generators of discrete and continuous distributions are tested for each individual basic generator. VSL can provide several methods of random number generation for any probability distribution. For example, two methods are implemented for Poisson distribution: PTPE acceptance/rejection algorithm and PoisNorm inverse transformation algorithm, based on transformation of normal distribution. The generator is tested for each of the implemented methods. VSL offers two different implementations for each of continuous distributions: • single-precision real arithmetic • double-precision real arithmetic. Single-precision generator implementation is, as a rule, faster than that for double-precision implementation. Moreover, single-precision implementation is quite sufficient for most applications. VSL offers only one implementation for discrete distributions. Apart from the above-mentioned factors, RNGs are dependent for their quality on distribution parameters. For example, different transformation techniques may be used for different parameters. Therefore, generators are also tested for different parameter sets. 9.1 Interpreting Test Results Test results for general distribution generators are interpreted almost in the same way as for basic generators. For reliable results, either one-level (threshold) or two-level testing is performed. 9.2 Description of Distribution Generator Tests This section describes the available Distribution Generator Tests:NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 79 • Confidence Test • Distribution Moments Test • Chi-Squared Goodness-of-Fit Test 9.2.1 Confidence Test 9.2.1.1 Test Purpose The test checks how well each output member corresponds to the valid range of possible values. For example, for an exponential distribution with parameters a and ß all the output members xi should lie within the range . A value is impossible, that is, the fact that the variate X of exponential distribution with parameters a and ß acquires a value less than a is an impossible event (not to be confused with a null event). Any output member lying outside the valid range constitutes the case of an error. Such a test is necessary because statistical tests (for example, distribution moments test or chisquare test) are unable to detect a small number (if compared with the total sample size) of xi values falling outside the valid range. 9.2.1.2 Interpreting Final Results The test gives a certain quantity K of random numbers that lie outside the valid range of values. The test is considered passed, if K = 0, and failed otherwise. 9.2.2 Distribution Moments Test 9.2.2.1 Test Purpose The test verifies that sample moments of a given distribution agree with theoretical moments. Sample mean (first order moment) and sample variance (central moment of the second order) are considered as stable response. 9.2.2.2 First Level Test The generated random number sequence is used to compute the sample mean M and the sample variance D that are of an asymptomatically normal distribution. Proceeding from this asymptotic, pvalues and are found using the values of M and D. 9.2.2.3 Second Level Test The first level test is run 10 times, each run producing a pair of p-values and , j = 1, 2, ... , 10. The Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling’s statistics is applied to the obtained p-values , j = 1, 2, ... , 10. If the resulting p-value p M < 0.05 or p M > 0.95, the test is considered failed for the sample mean. The same procedure is performed for p-values , j = 1, 2, ... , 10, and if p-value p D < 0.05 or p D > 0.95, the test is considered failed for the sample variance. 9.2.2.4 Interpreting Final Results 10 runs of the second level test provide the percentage FAILM of failed tests for the sample mean and the percentage FAILD of failed tests for the sample variance. The final result of the test is the percentage FAIL = max(FAILM, FAILD ). The value of FAIL < 50% is considered acceptable.Intel(R) MKL Vector Statistical Library Notes 80 9.2.3 Chi-Squared Goodness-of-Fit Test 9.2.3.1 Test Purpose The test verifies that the sample distribution function agrees with the hypothesized distribution. A chisquared V statistic with the number of degrees of freedom that is minus one from the number of the intervals of partition is considered a stable response. 9.2.3.2 First Level Test For a given parameter set and a given sample size the test computes the partition of the distribution domain into disjoint intervals so that the a priori quantity of random numbers from each interval is of order 100. The test computes the actual number of random values within each interval of the generated sample and then calculates chi-square of the statistic V. Since V is asymptotically of chi-squared distribution Fk-1(x) with k - 1 degrees of freedom, where k is the number of the intervals, p-value, which is equal to Fk-1(V), should be of a distribution that is close to uniform. 9.2.3.3 Second Level Test The first level test is run 10 times, each run producing a p-value , j = 1, 2, ... , 10. The Kolmogorov-Smirnov goodness-of-fit test with Anderson-Darling’s statistics is applied to the obtained p-values , j = 1, 2, ... , 10. If the resulting p-value p M < 0.05 or p M > 0.95, the test is considered failed. 9.2.3.4 Interpreting Final Results The final result of the test is the percentage FAIL of failed second level tests. The second level test is run 10 times. The value of FAIL < 50% is considered acceptable. 9.2.4 Performance The following factors influence the performance of an RNG of a given distribution: • architecture and configuration of the hardware and software • performance of the underlying BRNG • method of transformation • number of random numbers to be generated (size of the output vector) • parameters of a given probability distribution. VSL random number generators are optimized for Intel(R) Xeon(R) Processor X7560 and Intel(R) Xeon(R) Processor X5670. For more detals on performance, see Vector Statistical Library (VSL) Performance Data document available at http://software.intel.com/en-us/articles/intel-mathkernel-library-documentation/. For earlier Intel processors VSL generators are fully functional, yet not specifically optimized. The value of CPE (Clocks Per Element), which is independent from the processor clock rate, is selected as a unit of measurement. For example, if the generator performance is equal to 10 CPE and the processor rate is 1 GHz, then the generator will produce 108 random numbers per second. The VSL BRNGs differ from each other in speed, therefore data on performance of general (discrete and continuous) distribution generators is given separately for each BRNG used as an underlying generator to produce uniformly distributed random numbers. NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 81 Performance of a general distribution generator also depends on a method chosen for transforming a uniform distribution to a given non-uniform one. This requires specifying the applied transformation method as well. The length of a generated vector is another factor influencing the performance of the VSL vector type generators. Calling generators on short vector lengths may prove highly ineffective. See the figure for the typical interdependence between the generator performance and the vector length. Finally, the generator performance may vary according to probability distribution parameters. The tables provide performance data only for fixed parameter values (or fixed intervals of parameter variations). Table footnotes contain parameters with which a given performance is obtained. For some transformation methods the performance is approximately the same on a wide range of parameters, such methods being called uniformly fast, while for others the performance may vary considerably with variation in the distribution parameters, for example, in PTPE method for an RNG of Poisson distribution. When the latter is the case, graphs of interdependence between the performance and the distribution parameters are provided. 9.3 Continuous Distribution Functions This section describes VSL Continuous Distribution Functions: • Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFORM_STD_ACCURATE) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) • Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) • GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) • Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_EXPONENTIAL_ICDF_ACCURAT E) • Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) • Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) • Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) • Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) • Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/SL_RNG_METHOD_LOGNORMAL_ BOXMULLER2_ACCURATE) • Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF) • Gamma (VSL_RNG_METHOD_GAMMA_GNORM/VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) • Beta (VSL_RNG_METHOD_BETA_CJA/VSL_RNG_METHOD_BETA_CJA_ACCURATE)Intel(R) MKL Vector Statistical Library Notes 82 9.3.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD/VSL_RNG_METHOD_UNIFOR M_STD_ACCURATE) Random number generator of uniform distribution over the real interval [a,b]. You may identify the underlying BRNG by passing the random stream descriptor stream as a parameter. Then Uniform function calls real implementation (of single precision for vsRngUniform and of double precision for vdRngUniform) of this basic generator. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.2 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may obtain any successive random number x of the standard normal distribution according to the formula (for details, see [Box58]) , where u1, u2 are a pair of successive random numbers uniformly distributed over the interval (0, 1). The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.3 Gaussian (VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may produce a successive pair of the random numbers x1, x2 of the standard normal distribution according to the formula (for details, see [Box58]) where u1, u2 are a pair of successive random numbers uniformly distributed over the interval (0, 1). The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. In VSL you can safely call this method even when the random numbers are generated in blocks with the size aliquant to 2. Consider the following example. Suppose, you use the method VSL_METHOD_DGAUSSIAN_BOXMULLER2 to generate a pair of random numbers of the standard normal distribution. Option 1. Single call of method VSL_METHOD_DGAUSSIAN_BOXMULLER2 with the vector length equal to 2: ... double x[2]; ...NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 83 vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 2, x, 0.0, 1.0); ... In this case, you generate the random numbers x[0], x[1] by the formula Option 2. Double call of the method VSL_METHOD_DGAUSSIAN_BOXMULLER2 with the vector length equal to 1: ... double x[2]; ... vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 1, &x[0], 0.0, 1.0); vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, 1, &x[1], 0.0, 1.0); ... At the first call of vdRngGaussian you produce the random number x[0] by the formula At the second call of vdRngGaussian the vector length, over which you initially called the function to generate the random stream, is recognized as odd (equal to 1 in this case). Then the random number x[1] is generated by the formula and not by the formula , as it might be supposed. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.4 Gaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF) Random number generator of normal (Gaussian) distribution with the parameters a and s. You may obtain any successive random number x of the standard normal distribution by the inverse transformation method from the following formula: , where u is a random number uniformly distributed over the interval (-1, 1), and is inverse to the error function .Intel(R) MKL Vector Statistical Library Notes 84 The normal distribution with the parameters a and s is transformed to the random number y by scaling and the shift y = sx+a. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.5 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.6GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_BOXMULLER2) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.7 GaussianMV (VSL_RNG_METHOD_GAUSSIANMV_ICDF) Random number generator of d-variate (correlated) normal distribution with the parameters a and T. You may obtain any successive random vector according to the formula , where is a d-dimensional vector of random numbers from standard normal distribution, is a lower triangular d×d matrix - Cholesky factor of variance-covariance matrix. Random numbers from standard normal distribution are generated by the method VSL_RNG_METHOD_GAUSSIAN_ICDF.NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 85 See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.3.8 Exponential (VSL_RNG_METHOD_EXPONENTIAL_ICDF/VSL_RNG_METHOD_E XPONENTIAL_ICDF_ACCURATE) Random number generator of the exponential distribution with the parameters a and . You may generate any successive random number x of the exponential distribution by the inverse transformation method from the formula: , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.9 Laplace (VSL_RNG_METHOD_LAPLACE_ICDF) Random number generator of the Laplace distribution with the parameters a and . You may generate any successive random number x of the Laplace distribution by the inverse transformation method from the formula: , where u1, u2 is a pair of successive random numbers of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.10 Weibull (VSL_RNG_METHOD_WEIBULL_ICDF/ VSL_RNG_METHOD_WEIBULL_ICDF_ACCURATE) Random number generator of the Weibull distribution with the parameters , a and . You may generate any successive random number x of the Weibull distribution by the inverse transformation method from the formula , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.11 Cauchy (VSL_RNG_METHOD_CAUCHY_ICDF) Random number generator of the Cauchy distribution with the parameters a and . You may generate any successive random number x of the Cauchy distribution by the inverse transformation method from the formula ,Intel(R) MKL Vector Statistical Library Notes 86 where u is a successive random number of a uniform distribution over the interval (-p/2, p/2). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.12 Rayleigh (VSL_RNG_METHOD_RAYLEIGH_ICDF/ VSL_RNG_METHOD_RAYLEIGH_ICDF_ACCURATE) Random number generator of the Rayleigh distribution with the parameters a and . You may generate any successive random number x of the Rayleigh distribution by the inverse transformation method from the formula , where u is a successive random number of a uniform distribution over the interval (0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.13 Lognormal (VSL_RNG_METHOD_LOGNORMAL_ BOXMULLER2/VSL_RNG_METHOD_LOGNORMAL_BOXMULL ER2_ACCURATE) Random number generator of the lognormal distribution with the parameters a, , b and . You may generate any successive random number x of the lognormal distribution by the inverse transformation method from the formula , where y is a successive random number of a normal (Gaussian) distribution with the parameters a and . The random numbers of the normal distribution are generated using the method VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.14 Gumbel (VSL_RNG_METHOD_GUMBEL_ICDF) Random number generator of the Gumbel distribution with the parameters a and . You may generate any successive random number x of the Gumbel distribution by the inverse transformation method from the formula , where y is a successive random number of an exponential distribution with the parameters a=0 and . The random numbers of the exponential distribution are generated using the method VSL_RNG_METHOD_EXPONENTIAL_ICDF. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary.NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 87 9.3.15 Gamma (VSL_RNG_METHOD_GAMMA_GNORM/ VSL_RNG_METHOD_GAMMA_GNORM_ACCURATE) Random number generator of the gamma distribution with the parameters shape , offset a, and scalefactor . You may generate any successive random number of the standard gamma distribution (a=0, =1) as follows: • if > 1, a gamma distributed random number can be generated as a cube of properly scaled normal random number [Mars2000]. The algorithm is based on the acceptance/rejection method using squeeze technique. • If < 1, a gamma distributed random number is generated using two acceptance/rejection based algorithms: ? if < 0.6, a gamma distributed random number is obtained by transformation of exponential power distributed random number [Dev86], ? otherwise, rejection method from Weibull distribution is used [Vad77], [Dev86]. Note that when =1 gamma distribution is reduced to exponential distribution with parameters a, . The random numbers of the exponential distribution are generated using the method VSL_RNG_METHOD_EXPONENTIAL_ICDF. The gamma distributed random number with the parameters , a, and is transformed from using scale and shift . See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.3.16 Beta (VSL_RNG_METHOD_BETA_CJA/ VSL_RNG_METHOD_BETA_CJA_ACCURATE) Random number generator of the beta distribution with two shape parameters p and q, offset a, and scalefactor . You may generate any successive random number of the standard gamma distribution (a=0, =1) as follows: • if >1, Cheng algorithm is used (for details, see [Cheng78]) • if <1, composition of two algorithms is applied: if , where K = 0.852..., C = - 0.956..., Jöhnk algorithm is used (for details, see [Jöhnk64]); otherwise Atkinson switching algorithm is used (for details, see [Atkin79]) • if <1 and >1, the random numbers are generated using the switching algorithm of Atkinson (for details, see [Atkin79]) • if =1 or =1, the inverse transformation method is used • if =1 and =1, standard beta distribution is reduced to the uniform distribution over the interval (0,1). The random numbers of the uniform distribution are generated using the VSL_RNG_METHOD_UNIFORM_STD method. The algorithms of Cheng and Atkinson use acceptance/rejection technique. The beta distributed random number with the parameters , , a, and is transformed from as follows: . See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary.Intel(R) MKL Vector Statistical Library Notes 88 9.4 Discrete Distribution Functions This section describes VSL Discrete Distribution Functions: • Uniform (VSL_RNG_METHOD_UNIFORM_STD) • UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) • UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) • UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) • Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF) • Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF) • Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE) • Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) • Poisson (VSL_RNG_METHOD_POISSON_PTPE) • Poisson (VSL_RNG_METHOD_POISSON_POISNORM) • PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) • NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 9.4.1 Uniform (VSL_RNG_METHOD_UNIFORM_STD) Uniform discrete distribution over the integer interval . You may generate any successive random number k of the uniform distribution by the formula: , where u is a successive random number of a uniform (continuous) distribution over the interval and stands for the operation floor(x) that produces the maximum integer, which does not exceed x. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.2 UniformBits (VSL_RNG_METHOD_UNIFORMBITS_STD) A random number generator of uniform distribution that produces an integer (non-normalized to the interval (0, 1)) sequence. You may identify the underlying BRNG by passing the random stream descriptor stream as a parameter. Then UniformBits function calls integer implementation of this basic generator. Basic generators differ in bit capacity and structure of the integer output, therefore you should interpret the output integer array of the function viRngUniformBits correctly. The following table provides rules for interpreting 32-bit integer output r[i] for each VSL basic generator. BRNG Integer Recurrence Interpretation of 32-bit integer output array r[i] after calling viRngUniformBiNegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 89 ts MCG31m1 R250 MRG32k3 a MCG59 WH MT19937 , , , , whereIntel(R) MKL Vector Statistical Library Notes 90 , with . MT2203 , where , with , . SFMT1993 7 SOBOLNegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 91 , where , and s is the dimension of quasi-random vector. NIEDERR , where , and s is the dimension of quasi-random vector. Notes: • means obtaining lower 32 bits of the 64-bit unsigned integer x, that is, . • means obtaining upper 32 bits of the 64-bit unsigned integer x, that is, . So, when you generate an integer sequence of n elements, the output array r[i] of the function viRngUniformBits comprises: • n elements for the basic generators MCG31m1, R250, MRG32k3a, MT19937, MT2203, SOBOL, and NIEDERR • 2n elements for the basic generator MCG59 • 4n elements for the basic generators WH and SFMT19937. You may use the integer output, in particular, for fast generation of bit vectors. However, in this case some bits (or groups of them) may happen to be non-random. For example, lower bits produced by linear congruential generators are less random than their higher bits. Note that quasi-random numbers are not random at all. Thoroughly check the integer output bits and bit groups for randomness before forming bit vectors from r[i] array. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.3 UniformBits32 (VSL_RNG_METHOD_UNIFORMBITS32_STD) A random number generator that produces uniformly distributed bits in 32-bit chunks. Some basic random number generators produce integers in which not all of the bits are uniformly distributed, for example • The least significant bits in the integers produced by MCG59 BRNG are less random, e.g. the lower four bits form a congruential sequence of period at most 16; and the least significant bit is either constant or strictly alternating (see, for example, [Knuth81]).Intel(R) MKL Vector Statistical Library Notes 92 • By design, BRNGs do not produce the most significant bits setting them to zero, e.g. MCG31m1 is a 31-bit generator, and MCG59 is a 59-bit generator. The UniformBits32 function transforms the underlying BRNG integer recurrence so that all bits in 32- bit chunks are uniformly distributed. This function does not support the following VSL BRNGs: • VSL_BRNG_MCG31 • VSL_BRNG_R250 • VSL_BRNG_MRG32K3A • VSL_BRNG_WH • VSL_BRNG_SOBOL • VSL_BRNG_NIEDERR • VSL_BRNG_IABSTRACT • VSL_BRNG_DABSTRACT • VSL_BRNG_SABSTRACT 9.4.4 UniformBits64 (VSL_RNG_METHOD_UNIFORMBITS64_STD) A random number generator that produces uniformly distributed bits in 64-bit chunks. The generator addresses the same BRNG issues as its 32-bit counterpart, UniformBits32 does. The UniformBits64 function transforms the underlying BRNG integer recurrence so that all bits in 64- bit chunks are uniformly distributed. This function does not support the following VSL BRNGs: • VSL_BRNG_MCG31 • VSL_BRNG_R250 • VSL_BRNG_MRG32K3A • VSL_BRNG_WH • VSL_BRNG_SOBOL • VSL_BRNG_NIEDERR • VSL_BRNG_IABSTRACT • VSL_BRNG_DABSTRACT • VSL_BRNG_SABSTRACT 9.4.5 Bernoulli (VSL_RNG_METHOD_BERNOULLI_ICDF) Bernoulli distribution with the parameter p. You may generate any successive random number k of the Bernoulli distribution by the formula: ,NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) 93 where u is a successive random number of a uniform distribution over the interval [0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.6 Geometric (VSL_RNG_METHOD_GEOMETRIC_ICDF) Geometrical distribution with the parameter p. You may generate any successive random number k of the geometrical distribution by the formula: , where u is a successive random number of a uniform distribution over the interval [0, 1). See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary. 9.4.7 Binomial (VSL_RNG_METHOD_BINOMIAL_BTPE) Binomial distribution with the parameters ntrial and p. If , random numbers of the binomial distribution are generated by BTPE method (see [Kach88] for details), otherwise combination of inverse transformation and table lookup methods is used. BTPE method is a variation of the acceptance/rejection method that uses linear (on the fractions close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, areas with zero probability of rejection are introduced and squeezing technique is applied. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.8 Hypergeometric (VSL_RNG_METHOD_HYPERGEOMETRIC_H2PE) Hypergeometric distribution with the parameters l, s, and m. If and , where , , , the random numbers are generated by H2PE method (see [Kach85] for details), otherwise by the inverse transformation method in combination with the table lookup method. H2PE method is a variation of the acceptance/rejection method that uses constant (on the fraction close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, squeezing technique is applied. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.9 Poisson (VSL_RNG_METHOD_POISSON_PTPE) Poisson distribution with the parameter . If , random numbers are generated by PTPE method (see [Schmeiser81] for details), otherwise combination of inverse transformation and table lookup methods is used. PTPE method is a variation of the acceptance/rejection method that uses linear (on the fraction close to the distribution mode) and exponential (at the distribution tails) functions as majorizing functions. To avoid time consuming acceptance/rejection checks, areas with zero probability of rejection are introduced and squeezing technique is applied.Intel(R) MKL Vector Statistical Library Notes 94 See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.10 Poisson (VSL_RNG_METHOD_POISSON_POISNORM) Poisson distribution with the parameter . If , the random numbers are generated by combination of inverse transformation and table lookup methods. Otherwise they are produced through transformation of the normally distributed random numbers. The VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 method is used to generate random numbers of normal distribution. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.11 PoissonV (VSL_RNG_METHOD_POISSONV_POISNORM) Poisson distribution with the parameter . If , the random numbers are generated by inverse transformation method. Otherwise they are produced through transformation of normally distributed random numbers. The VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2 method is used to generate random numbers of normal distribution. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs. 9.4.12 NegBinomial (VSL_RNG_METHOD_NEGBINOMIAL_NBAR) Negative binomial distribution with the parameters a and p. If , the random numbers are generated by NBAR method, otherwise by combination of inverse transformation and table lookup methods. NBAR method is a variation of the acceptance/rejection method that uses constant and linear functions (on the fraction close to the distribution mode) and exponential functions (at the distribution tails) as majorizing functions. To ensure that the majorizing functions are close to the normalized probability mass function, five 2D figures are formed from the majorizing and minorizing functions as well as from other auxiliary curves. To avoid time-consuming acceptance/rejection checks, areas with zero probability of rejection are introduced. See http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm for test results summary and performance graphs.95 Bibliography [Ant79] Antonov, I.A., and Saleev, V.M. An economic method of computing LPt-sequences. USSR Comput. Math. Math. Phys., 19, 252-256, 1979. [Atkin79] Atkinson A.C. A family of switching algorithms for the computer generation of beta random variables, Biometrika, 66, 1, 141-145, 1979. [Box58] Box, G. E. P. and Muller, M. E. A Note on the Generation of Random Normal Deviates. Ann. Math. Stat. 28, 610-611, 1958. [Brat87] Bratley, P., Fox, B.L., and Schrage, L.E.. A Guide to Simulation, 2 nd Edition, Springer-Verlag, New York, 1987. [Brat88] Bratley, P. and Fox, B.L. ALGORITHM 659: Implementing Sobol’s Quasirandom Sequence Generator. ACM Transactions on Modeling and Computer Simulation, Vol. 14, No. 1, 88-100, March 1988. [Brat92] Bratley, P., Fox, B.L., and Niederreiter, H. Implementation and Tests of Low-Discrepancy Sequences. ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, 195-213, July 1992. [Cheng78] Cheng, R. C. H., Generating Beta variates with Nonintegral Shape Parameters, Communications of the ACM, 21, 4, 317-322, 1978. [Cram46] Cramer, H. Mathematical Methods of Statistics. Cambridge, 1946. [Dev86] Devroye, L. Non-Uniform Random Variate Generation, Springer-Verlag, New York, 1986. [Ent98] Entacher, Karl. Bad Subsequences of Well-Known Linear Congruential Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, 61-70, January 1998. [Jöhnk64] Jöhnk, M.D. Erzeugung von Betaverteilten und Gammaverteilten Zufallszahlen, Metrika, 8, 5-15, 1964. [Jun99] Jun, B., and Kocher, P. The Intel Random Number Generator. White paper prepared for Intel Corp., Cryptography Research, Inc., April 1999. [Kach88] Kachitvichyanukul, V. and Schmeiser, B.W. Binomial random variate generation. Communications of the ACM, Volume 31, Issue 2, February 1988. [Kach85] Kachitvichyanukul, V. and Schmeiser, B.W. Computer generation of hypergeometric random variates. J. Stat. Comput. Simul. 22, 1, 127-145, 1985. [Kirk81] Kirkpatrick, S., and E. Stoll. A Very Fast Shift-Register Sequence Random Number Generator. Journal of Computational Physics, V. 40, 517-526, 1981. [Knuth81] Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 2 nd edition, Addison-Wesley Publishing Company, Reading, Massachusetts, 1981. [L’Ecu94] L’Ecuyer, Pierre. Uniform Random Number Generators, Annals of Operations Research, 53, 77- 120, 1994. [L’Ecu99] L'Ecuyer, P. Good Parameter Sets for Combined Multiple Recursive Random Number Generators. Operations Research, 47, 1, 159-164, 1999. [L’Ecuyer99] L'Ecuyer, Pierre. Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure. Mathematics of Computation, 68, 249-260, 1999. [MacLaren89] MacLaren, N.M. The Generation of Multiple Independent Sequences of Pseudorandom Numbers. Applied Statistics, 38, 351-359, 1989.Intel(R) MKL Vector Statistical Library Notes 96 [Mars95] Marsaglia, G. The Marsaglia Random Number CDROM, including the DIEHARD Battery of Tests of Randomness, Department of Statistics, Florida State University, Tallahassee, Florida, 1995. [Mars2000] Marsaglia, G., and Tsang, W. W. A simple method for generating gamma variables, ACM Transactions on Mathematical Software, Vol. 26, No. 3, Pages 363-372, September 2000. [Matsum92] Matsumoto, M., and Kurita, Y. Twisted GFSR generators, ACM Transactions on Modeling and Computer Simulation, Vol. 2, No. 3, Pages 179-194, July 1992. [Matsum94] Matsumoto, M., and Kurita, Y. Twisted GFSR generators II, ACM Transactions on Modeling and Computer Simulation, Vol. 4, No. 3, Pages 254-266, July 1994. [Matsum98] Matsumoto, M., and Nishumira T. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 3-30, January 1998. [Matsum2000] Matsumoto, M., and Nishimura T. Dynamic Creation of Pseudorandom Number Generators, 56- 69, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Ed. Niederreiter, H. and Spanier, J., Springer 2000, http://www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/DC/dc.html. [Mikh2000] Mikhailov, G.A. Weight Monte Carlo Methods, Novosibirsk: SB RAS Publ., 2000 (In Russian). [MT2002] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html [NAG] Numerical Algorithms Group, www.nag.co.uk. [Ripley87] Ripley, B.D. Stochastic Simulation, Wiley, New York, 1987. [Saito08] Saito, M., and Matsumoto, M. SIMD-oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator,Monte Carlo and Quasi-Monte Carlo Methods 2006, Springer, pp. 607-622, 2008, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/earticles.html [Schmeiser81] Schmeiser, Bruce, and Kachitvichyanukul, Voratas. Poisson Random Variate Generation. Research Memorandum 81-4, School of Industrial Engineering, Purdue University, 1981. [Vad77] Vaduva, I. On computer generation of gamma random variables by rejection and composition procedures. Mathematische Operationsforschung und Statistik, Series Statistics, vol. 8, 545-576, 1977. [Ziff98] Ziff, Robert M. Four-tap shift-register-sequence random-number generators. Computers in Physics, Vol. 12, No. 4, Jul/Aug 1998. Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323648-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Starting the Intel ® C++ Compiler from the Eclipse* IDE..................................11 Starting the Intel ® C++ Compiler from the Command Line..............................11 Starting the Intel ® Debugger.......................................................................12 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Pointer Disambiguation.................................15 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................17 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................18 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................19 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright © 2001, Hewlett-Packard Development Company, L.P. Copyright ©2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Debugger 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Linux* operating system, including how to: • install the Intel ® C++ Composer XE 2011 on a supported Linux distribution. See the Release Notes. • open a Linux shell and execute fundamental commands including make. • compile and link C/C++ source files. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Starting the Intel® C++ Compiler from the Eclipse* IDE The Intel ® C++ Compiler XE 12.1 for Linux* OS compiles C and C++ source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. You must first install and configure Eclipse on your system, then you can configure Eclipse to use the Intel ® C++ Compiler XE 12.1. See the Getting Started section in the compiler documentation for current information about compiling applications with Eclipse*. The Using Eclipse* section provides detailed information about configuring and using Eclipse with the Intel ® C/C++ Compilers. Starting the Intel® C++ Compiler from the Command Line The Intel ® C++ Compiler XE 12.1 for Linux* OS compiles C and C++ source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. Start using the compiler by performing the following steps: 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compilers and libraries for IA-32 architectures only • intel64: Compilers and libraries for Intel ® 64 architectures only 11To compile C source files, use a command similar to the following: icc my_source_file.c To compile C++ source files, use a command similar to the following: icpc my_source_file.cpp Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Linux*, you can use the Intel Debugger from a Java* GUI application or the command-line. • To start the GUI for the Intel Debugger, execute the idb command from a Linux shell. • To start the command-line invocation of the Intel Debugger, execute the idbc command from a Linux shell. 12 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//C++/vec_samples/ Use these files for this tutorial: • Driver.c • Multiply.c • Multiply.h 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: icc -O1 -std=c99 -DNOFUNCCALL Multiply.c Driver.c -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. This example uses a variable length array (VLA), and therefore, must be compiled with the -std=c99 option. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): icc -std=c99 -DNOFUNCCALL -vec-report1 Multiply.c Driver.c -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. icc -std=c99 -DNOFUNCCALL -vec-report2 Multiply.c Driver.c -o MatVector The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImproving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the -restrict compiler option for .c or .cpp files, or the -std=c99 compiler option for .c files. Replace the NOFUNCCALL macro with NOALIAS. icc -std=c99 -vec-report2 -DNOALIAS Multiply.c Driver.c -o MatVector This conditional compilation replaces the loop in the main program with a function call. Execute MatVector and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); 15 Tutorial: Intel® C++ Compiler 2In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED Multiply.c Driver.c -o MatVector Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED -ipo Multiply.c Driver.c -o MatVector Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 16 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAdditional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the macro, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//C++/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.cpp • main.h • scalar_dep.cpp 17 Tutorial: Intel® C++ Compiler 2• scalar_dep.h Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make gap_vec_report from the command-line, or execute: icpc -c -guide scalar_dep.cpp The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: icpc -c -guide -parallel scalar_dep.cpp The compiler emits the following: GAP REPORT LOG OPENED ON Wed Jul 28 14:33:09 2010 scalar_dep.cpp(51): remark #30523: (PAR) Loop at line 51 cannot be parallelized due to conditional assignment(s) into the following variable(s): b. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "#pragma parallel private(b)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.cpp(51): remark #30525: (PAR) If the trip count of the loop at line 51 is greater than 188, then use "#pragma loop count min(188)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 188 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG In the GAP Report, remark #30523 indicates that loop at line 51 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } #endif } To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition TEST_GAP to compile the appropriate code path. From the command-line, execute make final, or run the following: icpc -c -parallel -DTEST_GAP -vec-report1 -par-report1 scalar_dep.cpp 19 Tutorial: Intel® C++ Compiler 2The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.cpp(43) (col. 3): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. 20 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323649-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® C++ Composer XE 2011..................................11 Starting the Intel ® Debugger.......................................................................12 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Pointer Disambiguation.................................15 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................17 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................18 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................19 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Debugger 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Mac OS* X, including how to: • install the Intel ® C++ Composer XE 2011 on a supported Mac OS* X version. See the Release Notes. • open a Mac OS* X command-line shell and execute fundamental commands including make. • compile and link C/C++ source files. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® C++ Composer XE 2011 The Intel ® C++ Compiler XE 12.1 for Mac OS* X compiles C and C++ source files on Mac OS* X operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. You can use the Intel C++ Compiler XE 12.1 in the Xcode* integrated development environment or from the command line. This tutorial assumes you are using Xcode*, but supplies general instructions for starting the compiler from a command line. Using the Compiler in Xcode* You must first create or choose an existing C or C++ Xcode* project. These instructions assume you are creating a new project. 1. Launch Xcode. 2. Choose New Project from the File menu. When the New Project Assistant window appears, select a project template under Application; for example, select Command Line Tool. Click Choose. 3. Click Next, then name your project (hello_world, for example) and specify a save location. Click Save. 4. From within the project, highlight the target you want to change in the Groups & Files list under the Target group. 5. Double-click the target you want to change in the Groups & Files list under the Target group. 6. In the Target Info window, click Rules. 7. To add a new rule, click the + button at the bottom, left-hand corner of the Target Info window. From the new Rule section: • under Process, choose C++ source files • under Using, choose Intel® C++ Compiler XE 12.1 8. Choose Build from the Build menu or click the Build and Go button in the toolbar. To view the results of your build, choose Build Results from the Build menu in the Xcode toolbar. See the Building Applications with Xcode* section in the compiler documentation for more information about using the compiler with the Xcode integrated development environment. Using the Compiler from the Command Line 11Start the compiler from a command line by performing the following steps: 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compilers and libraries for IA-32 architectures only • intel64: Compilers and libraries for Intel ® 64 architectures only To compile C source files, use a command similar to the following: icc my_source_file.c To compile C++ source files, use a command similar to the following: icpc my_source_file.cpp Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Mac OS* X, you can use the Intel Debugger only from the command-line. To start the command-line invocation of the Intel Debugger, execute the idb command. 12 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//C++/vec_samples/ Use these files for this tutorial: • Driver.c • Multiply.c • Multiply.h 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: icc -O1 -std=c99 -DNOFUNCCALL Multiply.c Driver.c -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. This example uses a variable length array (VLA), and therefore, must be compiled with the -std=c99 option. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): icc -std=c99 -DNOFUNCCALL -vec-report1 Multiply.c Driver.c -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. icc -std=c99 -DNOFUNCCALL -vec-report2 Multiply.c Driver.c -o MatVector The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImproving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the -restrict compiler option for .c or .cpp files, or the -std=c99 compiler option for .c files. Replace the NOFUNCCALL macro with NOALIAS. icc -std=c99 -vec-report2 -DNOALIAS Multiply.c Driver.c -o MatVector This conditional compilation replaces the loop in the main program with a function call. Execute MatVector and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); 15 Tutorial: Intel® C++ Compiler 2In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED Multiply.c Driver.c -o MatVector Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. icc -std=c99 -vec-report2 -DNOALIAS -DALIGNED -ipo Multiply.c Driver.c -o MatVector Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 16 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAdditional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the macro, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//C++/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.cpp • main.h • scalar_dep.cpp 17 Tutorial: Intel® C++ Compiler 2• scalar_dep.h Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make gap_vec_report from the command-line, or execute: icpc -c -guide scalar_dep.cpp The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.cpp(51): remark #30515: (VECT) Loop at line 51 cannot be vectorized due to conditional assignment(s) into the following variable(s): b. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: icpc -c -guide -parallel scalar_dep.cpp The compiler emits the following: GAP REPORT LOG OPENED ON Wed Jul 28 14:33:09 2010 scalar_dep.cpp(51): remark #30523: (PAR) Loop at line 51 cannot be parallelized due to conditional assignment(s) into the following variable(s): b. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "#pragma parallel private(b)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.cpp(51): remark #30525: (PAR) If the trip count of the loop at line 51 is greater than 188, then use "#pragma loop count min(188)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 188 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG In the GAP Report, remark #30523 indicates that loop at line 51 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } #endif } To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition TEST_GAP to compile the appropriate code path. From the command-line, execute make final, or run the following: icpc -c -parallel -DTEST_GAP -vec-report1 -par-report1 scalar_dep.cpp 19 Tutorial: Intel® C++ Compiler 2The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.cpp(43) (col. 3): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. scalar_dep.cpp(43) (col. 3): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. 20 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323647-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® C++ Composer XE 2011 .....................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Starting the Intel ® C++ Compiler from the Microsoft Visual Studio* IDE............11 Switching between the Installed Compilers....................................................12 Starting the Intel ® C++ Compiler from the Command Line..............................12 Starting the Intel ® Parallel Debugger Extension..............................................13 Chapter 2: Tutorial: Intel® C++ Compiler Using Auto Vectorization.............................................................................15 Introduction to Auto-vectorization.......................................................15 Establishing a Performance Baseline.....................................................16 Generating a Vectorization Report........................................................18 Improving Performance by Pointer Disambiguation.................................19 Improving Performance by Aligning Data..............................................20 Improving Performance with Interprocedural Optimization......................21 Additional Exercises...........................................................................22 Using Guided Auto-parallelization.................................................................22 Introduction to Guided Auto-parallelization...........................................22 Preparing the Project for Guided Auto-parallelization..............................22 Running Guided Auto-parallelization.....................................................23 Analyzing Guided Auto-parallelization Reports.......................................26 Implementing Guided Auto-parallelization Recommendations..................26 Threading Your Applications........................................................................30 Learning Objectives...........................................................................30 Threading Your Application.................................................................30 iiiiv Intel® C++ Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® C++ Composer XE 2011 Getting Started TutorialsIntroducing the Intel® C++ Composer XE 2011 This guide shows you how to start the Intel® C++ Composer XE 2011 and begin debugging code using the Intel® Parallel Debugger Extension. The Intel(R) C++ Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® C++ Compiler • Intel® Integrated Performance Primitives • Intel® Threading Building Blocks • Intel® Math Kernel Library • Intel® Parallel Debugger Extension Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the following: • ShowMe video for using Intel® C++ Composer XE with Microsoft Visual Studio* 78 Intel® C++ Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE. Although the instructions and screen captures in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE), you can use these tutorials with later versions of Visual Studio. Required Tools You need the following tools to use these tutorials: • Microsoft Visual Studio 2005 or later. • Intel ® C++ Composer XE 2011. • Sample code included with the Intel ® C++ Composer XE 2011. NOTE. • Samples are non-deterministic. Your results may vary from the examples shown throughout these tutorials. • Samples are designed only to illustrate features and do not represent best practices for creating multithreaded code. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of Microsoft Visual Studio, including how to: • open a project/solution. • access the Document Explorer. (valid in Microsoft Visual Studio 2005 /2008 ) • display the Solution Explorer. • compile and link a project. • ensure a project compiled successfully. 910 Intel® C++ Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Starting the Intel® C++ Compiler from the Microsoft Visual Studio* IDE The Intel ® C++ Composer XE 2011 integrates into the following versions of the Microsoft Visual Studio* Integrated Development Environment (IDE): • Microsoft Visual Studio 2010* • Microsoft Visual Studio 2008* • Microsoft Visual Studio 2005* Using the Intel ® C++ Composer XE 2011 from Microsoft Visual Studio* IDE To use the Intel ® C++ Compiler do the following: 1. Launch Microsoft Visual Studio*. 2. Open or create a Visual Studio solution in the Solution Explorer pane. 3. From the Project menu, select Intel C++ Compiler XE > Use Intel C++. 4. Click OK in the Confirmation dialog box. This configures the solution to use the Intel ® C++ Compiler. ( Visual Studio 2008 or Visual Studio 2005: you can configure the solution to use the Intel ® C++ Compiler by clicking on the toolbar icon . Visual Studio 2010: you can use Project > Properties General > Platform Toolset to select the Intel C++ Compiler. This method is equivalent to using the Use Intel C++ menu item except you can make the selection in individual build configurations.) 5. Select Rebuild Solution from the Visual Studio Build menu. The results of the compilation display in the Output window. Setting Intel ® C++ Compiler Options 1. Select Project > Properties. The Property Pages for your solution display. 2. Locate C/C++ in the list and expand the heading. 3. Step through the available properties to select your configuration. Compatibility 11The Intel ® C++ Compiler processes C and C++ language source files. The Intel ® C++ Compiler is fully sourceand binary-compatible (native code only) with the Microsoft Visual Studio* C++ compiler. The Intel C++ Compiler only supports native C++ project types provided by Visual Studio development environment. The project types with .NET attributes such as the ones below, cannot be converted to an Intel C++ project: • Empty Project (.NET) • Class Library (.NET) • Console Application (.NET) • Windows Control Library (.NET) • Windows Forms Application (.NET) • Windows Service (.NET) Refer to the User and Reference Guides for the full list of unsupported features. Switching between the Installed Compilers Switching to the Intel ® C++ Composer XE 2011 To switch to the Intel ® C++ Compiler do the following: 1. Launch Microsoft Visual Studio*. 2. Open the solution. 3. From the Project menu, select Intel C++ Compiler XE > Use Intel C++. 4. Click OK in the Confirmation dialog box. This configures the solution to use the Intel ® C++ Compiler. ( Visual Studio 2008 or Visual Studio 2005: you can configure the solution to use the Intel ® C++ Compiler by clicking on the toolbar icon . Visual Studio 2010: you can use Project > Properties General > Platform Toolset to select the Intel C++ Compiler. This method is equivalent to using the Use Intel C++ menu item except you can make the selection in individual build configurations.) Switching to the Microsoft Visual Studio* C++ Compiler If you are using the Intel® C++ Compiler, you can switch to the Visual C++ Compiler at any time. Switch compilers by doing the following: 1. Launch Microsoft Visual Studio*. 2. Open the solution. 3. From the Project drop-down menu, select Intel C++ Compiler XE > Use Visual C++. This action updates the solution file to use the Microsoft Visual Studio C++ compiler. All configurations of affected projects are automatically cleaned unless you select Do not clean project(s). If you choose not to clean projects, you will need to rebuild updated projects to ensure all source files are compiled with the new compiler. Starting the Intel® C++ Compiler from the Command Line Follow these steps to invoke the Intel ® C++ Compiler from the command line: 12 1 Intel® C++ Composer XE 2011 Getting Started Tutorials1. Open a command prompt from the Start>All Programs menu: Intel Parallel Studio XE 2011 >Command Prompt Intel Parallel Studio 2011 >Command Prompt. 2. Invoke the compiler as follows: icl [options... ] inputfile(s) [/link link_options] Use the command icl /help to display all available compiler options. Starting the Intel® Parallel Debugger Extension The Intel® Parallel Debugger Extension for Microsoft Visual Studio* is a debugging add-on for the Intel® Compiler's parallel code development features. It facilitates developing parallelism into applications based on the Intel® OpenMP* runtime environment. The Intel® Parallel Debugger Extension provides: • A new Microsoft Visual Studio* toolbar • An extension to the Microsoft Visual Studio* Debug menu • A set of new views and dialogs that are invoked from the toolbar or the menu tree The debugger features include: • C/C++ language support • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Preparing Applications for Parallel Debugging You must enable the parallel debug instrumentation with the compiler to enable parallel debugging, such as analyzing shared data or breaking at re-entrant function calls. To enable the parallel debug instrumentation: 1. Open your application project in Microsoft Visual Studio*. 2. Select Project > Properties... from the menu. The Projectname Property Pages dialog box opens. 3. Enable Parallel debug checking. 1. Select Configuration Properties > C/C++ > Debug in the left pane. 2. Under Enable Parallel Debug Checks, select Yes (/debug:parallel). 4. Click OK. 5. Rebuild your application. Your application is now instrumented for parallel debugging using the features of the Intel ® Parallel Debugger Extension. 13 Navigation Quick Start 114 1 Intel® C++ Composer XE 2011 Getting Started TutorialsTutorial: Intel® C++ Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® C++ Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel C++ Compiler at optimization levels of /O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by pointer disambiguation • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, open the vec_samples.zip archive in the product's Samples directory: \Samples\\C++\vec_samples.zip Use these files for this tutorial: • matrix_vector_multiplication_c.sln • matrix_vector_multiplication_c.vcproj • Driver.c • Multiply.c • Multiply.h 15Open the Microsoft Visual Studio solution file, matrix_vector_multiplication_c.sln, and follow the steps below to prepare the project for the vectorization exercises in this tutorial: 1. Convert to an Intel project by right-clicking on the matrix_vector_multiplication_c project and selecting Intel C++ Composer XE > Use Intel C++. Click OK in the Confirmation dialog. 2. Change the Active solution configuration to Release using Build > Configuration Manager. 3. Clean the solution by selecting Build > Clean Solution. Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with these settings: 1. Select Project > Properties > C/C++ > Optimization > General > Optimization > Minimize Size (/O1). 2. Select Project > Properties > C/C++ > Optimization > Intel Specific > Interprocedural Optimization > No. 16 2 Intel® C++ Composer XE 2011 Getting Started Tutorials3. Add the preprocessor definition, NOFUNCCALL, by selecting Project > Properties > C/C++ > Preprocessor > Preprocessor Definitions, then adding NOFUNCCALL to the existing list of preprocessor definitions. 4. Select Project > Properties > C/C++ > Langauage > Intel Specific > Enable C99 Support > Yes. 17 Tutorial: Intel® C++ Compiler 2This example uses a variable length array (VLA), and therefore, must be compiled with the /Qstd=c99 option. 5. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Add the /Qvec-report1 option to the command line by selecting Project > Properties > C/C++ > Command Line > Additional Options, then adding /Qvec-report1. Because vectorization is off at /O1, the compiler does not generate a vectorization report, so recompile at /O2 (default optimization): Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 150 noted in the vectorization report: Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. The /Qvec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. For C/C++ > Command Line > Additional Options, change /Qvec-report1 to /Qvec-report2. Also, for Linker > Command Line > Additional Options, add /Qvec-report2: 18 2 Intel® C++ Composer XE 2011 Getting Started TutorialsRebuild your project. The vectorization report indicates that the loop at line 45 in Multiply.c did not vectorize because it is not the innermost loop of the loop nest. Two versions of the innermost loop at line 55 were generated, but neither version was vectorized. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: loop was not vectorized: existence of vector dependence. Multiply.c(55) (col. 3): remark: loop skipped: multiversioned. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(148) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(150) (col. 4): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. NOTE. For more information on the /Qvec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. Improving Performance by Pointer Disambiguation Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path. In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x. NOTE. The restrict qualifier requires the use of either the /Qrestrict compiler option for .c or .cpp files, or the /Qstd=c99 compiler option for .c files. Replace the NOFUNCCALL preprocessor definition with NOALIAS. 19 Tutorial: Intel® C++ Compiler 2This conditional compilation replaces the loop in the main program with a function call. Rebuild your project, run the executable, and record the execution time reported in the output. Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Now that the compiler has been told that the arrays do not overlap, it knows that it is safe to vectorize the loop. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax: float array[30] __attribute((aligned(base, [offset]))); This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example: FTYPE a[ROW][COLWIDTH] __attribute((aligned(16))); In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned. NOTE. If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned. Rebuild the program after adding the ALIGNED preprocessor definition to ensure consistently aligned data. 20 2 Intel® C++ Composer XE 2011 Getting Started TutorialsMultiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop. Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the /Qipo option. Rebuild the program using the /Qipo option to enable interprocedural optimization. Select Optimization > Interprocedural Optimization > Multi-file(/Qipo) Note that the vectorization messages now appear at the point of inlining in Driver.c (line 155). Driver.c(145) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: loop was not vectorized: not inner loop. Driver.c(155) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED. Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop. Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient. Driver.c(60) (col. 3): remark: LOOP WAS VECTORIZED. Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient. Now, run the executable and record the execution time. 21 Tutorial: Intel® C++ Compiler 2Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by adding the preprocessor definition, FTYPE=float. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set COLBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, #pragma vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® C++ Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the /Qguide option with your normal compiler options at /O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using /Qguide in conjunction with /Qparallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the GuidedAutoParallel.zip archive located in the product's Samples directory located at: \Samples\\C++\ The following Visual Studio* 2005 project files and source files are included: • GAP-c.sln • GAP-c.vcproj • main.cpp • main.h 22 2 Intel® C++ Composer XE 2011 Getting Started Tutorials• scalar_dep.cpp • scalar_dep.h Open the Microsoft Visual Studio Solution file, GAP-c.sln, and follow the steps below to prepare the project for Guided Auto-parallelization (GAP). 1. Convert to an Intel project by right-clicking on the GAP-c project and selecting Intel C++ Composer XE > Use Intel C++. Click OK in the Confirmation dialog. 2. Clean the Solution by selecting Build > Clean Solution. 3. Since GAP is enabled only with option /O2 or higher, you will need to change the build configuration to Release using Build > Configuration Manager. Running Guided Auto-parallelization There are several ways to run GAP analysis in Visual Studio, depending on whether you want analysis for the whole solution, the project, a single file, a function, or a range of lines in your source code. In this tutorial, we will use single-file analysis. Follow the steps below to run a single-file analysis on scalar_dep.cpp in the GAP-c project: 1. In the GAP-c project, right-click on scalar_dep.cpp. 2. Select Intel C++ Composer XE > Guided Auto Parallelism > Run Analysis on file "scalar_dep.cpp" 3. If the /Qipo option is enabled, the Analysis with Multi-file optimization dialog appears. Click Run Analysis. 4. On the Configure Analysis dialog, click Run Analysis using the choices shown here: 23 Tutorial: Intel® C++ Compiler 2NOTE. If you select Send remarks to a file, GAP messages will not be available in the Output window or Error List window. See the GAP Report in the Output window. GAP reports in the standard Output window are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. 24 2 Intel® C++ Composer XE 2011 Getting Started TutorialsAlso, see the GAP Messages in the Error List window: 25 Tutorial: Intel® C++ Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.cpp: for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } In this example, the GAP Report generates a recommendation (remark #30761) to add the /Qparallel option to improve auto-parallelization. Remark #30515 indicates if variable b can be unconditionally assigned, the compiler will be able to vectorize the loop. Implementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the /Qparallel option to enable parallelization. Follow these steps to enable this option: 1. Right-click on the GAP-c project and select Properties 26 2 Intel® C++ Composer XE 2011 Getting Started Tutorials2. On the Property Pages dialog, expand the C/C++ heading and select Optimization. 3. In the right-hand pane under Intel Specific, select Parallelization, then choose Enable Parallelization (/Qparallel) and click OK. Now, run the GAP Analysis again and review the GAP Report: 27 Tutorial: Intel® C++ Compiler 2The remark #30521 indicates that loop at line 50 cannot parallelize because the variable b is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 188 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: #ifdef TEST_GAP #pragma loop count min (188) for (i=0; i 0) {A[i] = 1 / A[i];} if (A[i] > 1) {A[i] += b;} } #else for (i=0; i 0) {b=A[i]; A[i] = 1 / A[i]; } if (A[i] > 1) {A[i] += b;} } 28 2 Intel® C++ Composer XE 2011 Getting Started Tutorials#endif } To verify that the loop is parallelized and vectorized: 1. Add the options /Qvec-report1 /Qpar-report1 to the Linker > Command Line > Additional Options dialog. 2. Add the preprocessor definition TEST_GAP to compile the appropriate code path. 3. Rebuild the GAP-c project and note the reports in the output window: 29 Tutorial: Intel® C++ Compiler 2For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Threading Your Applications Learning Objectives In this tutorial, we will be building different parallel implementations of the same function with both the Microsoft Visual C++* Compiler and Intel ® C++ Composer XE 2011. When executed, the application will display the execution time required to render the object in the window title. This time is an indication of the speedup obtained with parallel implementations compared to a baseline established with a serial implementation in the first step. Threading Your Application Tachyon is a ray-tracer application, rendering objects described in data files. The Tachyon program is located in the product Samples directory: \Samples\\C++\Tachyon.zip. 30 2 Intel® C++ Composer XE 2011 Getting Started TutorialsExpand the archive to \Tachyon By default we use balls.dat as the input file. Data files are stored in the directory \Tachyon\dat\. Originally, Tachyon was an application with parallelism implemented in function pthread_create()(source file \Tachyon\src\Windows\pthread.cpp) with explicit threads: one for the rendering, and the other for calculations. In this tutorial we implement parallelization on the calculation thread with OpenMP*, Intel ® TBB, and Intel ® Cilk™ Plus. Parallelization is implemented only for one function draw_task(), which you can find in the source file build_serial.cpp, in project build_serial. Open the Microsoft Visual Studio* Solution \Tachyon\vc8\tachyon_compiler.sln. It includes these projects: • build_serial • build_with_cilk • build_with_openmp • build_with_tbb • tachyon.common NOTE. Projects build_with_openmp, build_with_tbb and build_with_tbbc use OpenMP, Intel ® TBB and Intel ® Cilk™ Plus, respectively. In addition to these implementations, there is also an option for users to implement with lambda functionality based on Intel TBB Follow the steps below to build the serial and Intel ® Cilk™ Plus approaches to Tachyon. Workflow Steps In the following, we will be building different parallel implementations of the same function with both the Microsoft Visual C++ Compiler and the Intel ® C++ Compiler. When executed, the application will display the execution time required to render the object in the window title. This time is an indication of the speedup obtained with parallel implementations compared to a baseline established with a serial implementation in the first step. 31 Tutorial: Intel® C++ Compiler 2Building the Serial Project 1. Set the build_serial project as the StartUp project (Project > Set as StartUp Project). 2. Set the configuration to Release mode: Build > Configuration Manager > Active solution configuration: > Release, then build the build_serial project. 3. Execute the application tachyon_compiler.exe with Debug > Start without Debugging. Take a note of the time in seconds displayed in the window title. This time to render the image is the baseline for parallelization with the Microsoft Visual C++ Compiler. 4. For projects build_serial and "tachyon.common" change compiler to Intel(R) Parallel Composer (Project > Intel C++ Composer XE 2011 > Use Intel C++ ...). 5. Rebuild build_serial in Release mode (now with Intel Compiler). 6. Execute the application. Note the time to render the image as the baseline for parallelization with the Intel C++ Compiler. Building with OpenMP* 1. Set the build_with_openmp project as StartUp project. 2. For project build_with_openmp, change the compiler to Intel C++ Composer XE (Project > Intel C++ Composer XE > Use Intel C++...). 3. For the project build_with_openmp, make sure the /Qopenmp compiler option is set (Project > Properties > Configuration Properties > C/C++ > Language > OpenMP Support = Generate Parallel Code (/Qopenmp)). 4. Open source file build_with_openmp.cpp in the project build_with_openmp. 5. Uncomment OpenMP* pragmas in the routine draw_task which create parallel regions and distribute loop iteration within the team of threads. 6. Comment out return inside parallel region in the routine draw_task. 7. Uncomment zero assignment to variable ison (ison = 0;) inside parallel region in the routine draw_task. 8. Uncomment return at the end of the routine draw_task. 9. Build build_with_openmp in Release configuration. 10. Execute the application. 11. Measure performance compared with the serial version. Options that use OpenMP are available for both Intel ® and non-Intel microprocessors, but these options may perform additional optimizations on Intel ® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP constructs and features that may perform differently on Intel ® vs. non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding. Building with Intel® TBB 1. Set build_with_tbb project as StartUp project. 32 2 Intel® C++ Composer XE 2011 Getting Started Tutorials2. For project build_with_tbb, change the compiler to Intel C++ Composer XE (Project > Intel C++ Composer XE > Use Intel C++...). 3. For the project build_with_tbb make sure the Intel ® TBB environment is set (Project > Intel C++ Composer XE > Select Build Components > Use TBB). See Note below. 4. Open source file build_with_tbb.cpp in the project build_with_tbb. 5. Uncomment TBB header files. 6. Uncomment class draw_task. 7. Comment out routine draw_task. 8. Uncomment lines regarding TBB schedule and number of threads in routine thread_trace. 9. Uncomment lines regarding grain size in routine thread_trace. 10. Uncomment TBB parallel_for routine in routine thread_trace. 11. Comment out call of routine draw_task in routine thread_trace. 12. Build build_with_tbb in Release configuration. 13. Execute the application. 14. Measure performance compared with the serial version. NOTE. Double check the following project properties are set: • Configuration Properties > C/C++ > General > Additional Include Directories: contains $(INTEL_DEF_IA32_INSTALL_DIR)TBB\Include • Configuration Properties > Linker > General > Additional Library Directories: contains "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc8" for Visual Studio 2005; "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc9" for Visual Studio 2008; "$(INTEL_DEF_IA32_INSTALL_DIR)TBB\Lib\ia32\vc10" for Visual Studio 2010; • For platform x64, the $(INTEL_DEF_X64_INSTALL_DIR) is used instead of $(INTEL_DEF_IA32_INSTALL_DIR) and the library directory becomes $(INTEL_DEF_X64_INSTALL_DIR)TBB\Lib\intel64\vc8 for Visual Studio 2005. Building with Intel® Cilk™ Plus 1. Set the build_with_cilk project as the StartUp project. 2. For project build_with_cilk change compiler to the Intel C++ Compiler (Project > Intel C++ Composer XE 2011 > Use Intel C++ ...). 3. For the project build_with_cilk make sure Intel ® Cilk™ Plus for Intel ® C++ Compiler additional include directory is set (Project > Properties > Configuration Properties > C/C++ > General > Additional Include Directories = C:\Program Files\Intel\ComposerXE-2011\compiler\include\cilk\). 4. Open source file build_with_cilk.cpp in the project build_with_cilk. 5. Uncomment Intel ® Cilk™ Plus header files. 6. Uncomment routine draw_task related to Intel ® Cilk™ Plus implementation. 7. Comment out the serial draw_task() function 33 Tutorial: Intel® C++ Compiler 28. Build build_with_cilk in Release mode. 9. Execute the application. 10. Measure performance compared with the serial version for Intel(R) Parallel Composer. Platform and Other Details The solution for this example was created in Microsoft Visual Studio 2005. If you open the tachyon_compiler.sln solution in Microsoft Visual Studio 2008, then it will be converted to a Microsoft Visual Studio 2008 solution. For Platform Win32 • The executable file for all implementations is tachyon_compiler.exe in the \Tachyon\vc8\Release\ directory. • Object files are stored in \Tachyon\vc8\tachyon_compiler\Release\ directory. For Platform x64 • The executable file for all implementations is tachyon_compiler.exe in the \Tachyon\vc8\x64\Release\ directory. • Object files are stored in \Tachyon\vc8\x64\tachyon_compiler\Release\ directory. 34 2 Intel® C++ Composer XE 2011 Getting Started Tutorials Intel ® Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® Fortran Composer XE 2011 ................................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® Fortran Composer XE 2011..............................11 Starting the Intel ® Debugger.......................................................................11 Chapter 2: Tutorial: Intel® Fortran Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................14 Improving Performance by Aligning Data..............................................15 Improving Performance with Interprocedural Optimization......................16 Additional Exercises...........................................................................16 Using Guided Auto-parallelization.................................................................17 Introduction to Guided Auto-parallelization...........................................17 Preparing the Project for Guided Auto-parallelization..............................17 Running Guided Auto-parallelization.....................................................17 Analyzing Guided Auto-parallelization Reports.......................................18 Implementing Guided Auto-parallelization Recommendations..................18 Using Coarry Fortran..................................................................................19 Introduction to Coarray Fortran...........................................................19 Compiling the Sample Program...........................................................20 Controlling the Number of Images.......................................................21 iiiiv Intel® Fortran Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® Fortran Composer XE 2011 Getting Started TutorialsIntroducing the Intel® Fortran Composer XE 2011 This guide shows you how to start the Intel® Fortran Composer XE 2011 and begin debugging code using the Intel® Debugger. The Intel(R) Fortran Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® Fortran Compiler • Intel® Math Kernel Library • Intel® Debugger 78 Intel® Fortran Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of the Linux* operating system, including how to: • install the Intel ® Fortran Composer XE 2011 on a supported Linux distribution. See the Release Notes. • open a Linux shell and execute fundamental commands including make. • compile and link Fortran source files. 910 Intel® Fortran Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® Fortran Composer XE 2011 The Intel ® Fortran Compiler XE 12.1 compiles Fortran source files on Linux* operating systems. The compiler is supported on IA-32 and Intel ® 64 architectures. The compiler operates only from a command line on Linux* operating systems. 1. Open a terminal session. 2. Set the environment variables for the compiler. 3. Invoke the compiler. One way to set the environment variables prior to invoking the compiler is to "source" the compiler environment script, compilervars.sh (or compilervars.csh): source /bin/compilervars.sh where is the directory structure containing the compiler /bin directory, and is the architecture argument listed below. The environment script takes an argument based on architecture. Valid arguments are as follows: • ia32: Compiler and libraries for IA-32 architectures only • intel64: Compiler and libraries for Intel ® 64 architectures only To compile Fortran source files, use a command similar to the following: ifort my_source_file.f90 Following successful compilation, an executable is created in the current directory. Starting the Intel® Debugger The Intel® Debugger (IDB) is a full-featured symbolic source code application debugger that helps programmers: • Debug programs • Disassemble and examine machine code and examine machine register values • Debug programs with shared libraries • Debug multithreaded applications 11The debugger features include: • Fortran language support including Fortran 95/90 • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Starting the Intel® Debugger On Linux*, you can use the Intel Debugger from a Java* GUI application or the command-line. • To start the GUI for the Intel Debugger, execute the idb command from a Linux shell. • To start the command-line invocation of the Intel Debugger, execute the idbc command from a Linux shell. 12 1 Intel® Fortran Composer XE 2011 Getting Started TutorialsTutorial: Intel® Fortran Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® Fortran Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel Fortran Compiler at optimization levels of -O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, locate the source files in the product's Samples directory: /Samples//Fortran/vec_samples/ Use these files for this tutorial: • driver.f90 • matvec.f90 13Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, compile your sources with these compiler options: ifort -real-size 64 -O1 -vec-report1 matvec.f90 driver.f90 -o MatVector Execute MatVector and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Because vectorization is off at -O1, the compiler does not generate a vectorization report, so recompile at -O2 (default optimization): ifort -real-size 64 -vec-report1 matvec.f90 driver.f90 -o MatVector Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 32 noted in the vectorization report: matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. The -vec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. ifort -real-size 64 -vec-report2 matvec.f90 driver.f90 -o MatVector The vectorization report indicates that the loop at line 33 in matvec.f90 did not vectorize because it is not the innermost loop of the loop nest. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. NOTE. For more information on the -vec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. 14 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsImproving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will insert an alignment directive for a, b, and c in driver.f90 with the following syntax: !dir$attributes align : 16 :: a,b,c This instructs the compiler to create arrays that it are aligned on a 16-byte boundary, which should facilitate the use of SSE aligned load instructions. In addition, the column height of the matrix a needs to be padded out to be a multiple of 16 bytes, so that each individual column of a maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the start of the arrays. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive !dir$ vector aligned NOTE. If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90 If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 32-byte aligned. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data: ifort -real-size 64 -vec-report2 -DALIGNED matvec.f90 driver.f90 -o MatVector matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. 15 Tutorial: Intel® Fortran Compiler 2Improving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the -ipo option. Recompile the program using the -ipo option to enable interprocedural optimization. ifort -real-size 64 -vec-report2 -DALIGNED -ipo matvec.f90 driver.f90 -o MatVector Note that the vectorization messages now appear at the point of inlining in driver.f90 (line 70). driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(73) (col. 16): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(70) (col. 14): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by changing the command-line option -real-size 64 to -real-size 32 The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. NOTE. In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. 16 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsUsing Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® Fortran Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the -guide option with your normal compiler options at -O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using -guide in conjunction with -parallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the source file archive located at: /Samples//Fortran/guided_auto_parallel.tar.gz The following files are included: • Makefile • main.f90 • scalar_dep.f90 Copy these files to a directory on your system where you have write and execute permissions. Running Guided Auto-parallelization You can use the -guide option to generate GAP advice. From a directory where you can compile the sample program, execute make vec from the command-line, or execute: ifort -c -guide scalar_dep.f90 The GAP Report appears in the compiler output. GAP reports are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. GAP REPORT LOG OPENED ON Mon Aug 2 14:04:34 2010 remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.f90(44): remark #30515: (VECT) Loop at line 44 cannot be vectorized due to conditional assignment(s) into the following variable(s): t. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG 17 Tutorial: Intel® Fortran Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.f90: do i = 1, n if (a(i) >= 0) then t = i end if if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do In this example, the GAP Report generates a recommendation (remark #30761) to add the -parallel option to improve auto-parallelization. Remark #30515 indicates if variable t can be unconditionally assigned, the compiler will be able to vectorize the loop. GAP REPORT LOG OPENED remark #30761: Add -parallel option if you want the compiler to generate recommendations for improving auto-parallelization. scalar_dep.f90(44): remark #30515: (VECT) Loop at line 44 cannot be vectorized due to conditional assignment(s) into the following variable(s): t. This loop will be vectorized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG Implementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the -parallel option to enable parallelization. From the command-line, execute make gap_par_report, or run the following: ifort -c -parallel -guide scalar_dep.f90 The compiler emits the following: GAP REPORT LOG OPENED ON Mon Aug 2 14:04:44 2010 scalar_dep.f90(44): remark #30523: (PAR) Loop at line 44 cannot be parallelized due to conditional assignment(s) into the following variable(s): t. This loop will be parallelized if the variable(s) become unconditionally initialized at the top of every iteration. [VERIFY] Make sure that the value(s) of the variable(s) read in any iteration of the loop must have been written earlier in the same iteration. [ALTERNATIVE] Another way is to use "!dir$ parallel private(t)" to parallelize the loop. [VERIFY] The same conditions described previously must hold. scalar_dep.f90(44): remark #30525: (PAR) If the trip count of the loop at line 44 is greater than 36, then use "!dir$ loop count min(36)" to parallelize this loop. [VERIFY] Make sure that the loop has a minimum of 36 iterations. Number of advice-messages emitted for this compilation session: 2. END OF GAP REPORT LOG 18 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsIn the GAP Report, remark #30523 indicates that loop at line 44 cannot parallelize because the variable t is conditionally assigned. Remark #30525 indicates that the loop trip count must be greater than 36 for the compiler to parallelize the loop. Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: do i = 1, n !dir$ if defined(test_gap) t = i !dir$else if (a(i) >= 0) then t = i end if !dir$ endif if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do To verify that the loop is parallelized and vectorized: • Add the compiler options -vec-report1 -par-report1. • Add the conditional definition test_gap to compile the appropriate code path. From the command-line, execute make w_changes, or run the following: ifort -c -parallel -Dtest_gap -vec-report1 -par-report1 scalar_dep.f90 The compiler's -vec-report and -par-report options emit the following output, confirming that the program is vectorized and parallelized: scalar_dep.f90(44) (col. 9): remark: LOOP WAS AUTO-PARALLELIZED. scalar_dep.f90(44) (col. 9): remark: LOOP WAS VECTORIZED. scalar_dep.f90(44) (col. 9): remark: LOOP WAS VECTORIZED. For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Using Coarry Fortran Introduction to Coarray Fortran The Intel® Fortran Compiler XE supports parallel programming using coarrays as defined in the Fortran 2008 standard. As an extension to the Fortran language, coarrays offer one method to use Fortran as a robust and efficient parallel programming language. Coarray Fortran uses a single-program, multi-data programming model (SPMD). Coarrays are supported in the Intel® Fortran Compiler XE for Linux* and Intel® Visual Fortran Compiler XE for Windows*. 19 Tutorial: Intel® Fortran Compiler 2This tutorial demonstrates how to compile a simple coarray Fortran application using the Intel Fortran Compiler XE, and how to control the number of images (processes) for the application. Locating the Sample To begin this tutorial, locate the source file in the product's Samples directory: /Samples//Fortran/coarray_samples/hello_image.f90 Copy hello_image.f90 to a working directory, then continue with this tutorial. NOTE. The Intel Fortran Compiler implementation of coarrays follows the standard provided in a draft version of the Fortran 2008 Standard. Not all features present in the Fortran 2008 Standard may be implemented by Intel. Consult the Release Notes for a list of supported features. Compiling the Sample Program The hello_image.f90 sample is a hello world program. Unlike the usual hello world, this coarray Fortran program will spawn multiple images, or processes, that will run concurrently on the host computer. Examining the source code for this application shows a simple Fortran program: program hello_image write(*,*) "Hello from image ", this_image(), & "out of ", num_images()," total images" end program hello_image Note the function calls to this_image() and num_images(). These are new Fortran 2008 intrinsic functions. The num_images() function returns the total number of images or processes spawned for this program. The this_image() function returns a unique identifier for each image in the range 1 to N, where N is the total number of images created for this program. To compile the sample program containing the Coarray Fortran features, use the -coarray compiler option: ifort -coarray hello_image.f90 -o hello_image If you run the hello_image executable, the output will vary depending on the number of processor cores on your system: ./hello_image Hello from image 1 out of 8 total images Hello from image 6 out of 8 total images Hello from image 7 out of 8 total images Hello from image 2 out of 8 total images Hello from image 5 out of 8 total images Hello from image 8 out of 8 total images Hello from image 3 out of 8 total images Hello from image 4 out of 8 total images By default, when a Coarray Fortran application is compiled with the Intel Fortran Compiler, the invocation creates as many images as there are processor cores on the host platform. The example shown above was run on a dual quad-core host system with eight total cores. As shown, each image is a separately spawned process on the system and executes asynchronously. 20 2 Intel® Fortran Composer XE 2011 Getting Started TutorialsNOTE. The -coarray option cannot be used in conjunction with -openmp options. One cannot mix Coarray Fortran language extensions with OpenMP extensions. Controlling the Number of Images There are two methods to control the number of images created for a Coarray Fortran application. First, you can use the -coarray-num-images=N compiler option to compile the application, where N is the number of images. This option sets the number of images created for the application during run time. For example, use the -coarraynum-images=2 option to the limit the number of images of the hello_image.f90 program to exactly two: ifort -coarray -coarray-num-images=2 hello_image.f90 -o hello_image Hello from image 2 out of 2 total images Hello from image 1 out of 2 total images The second way to control the number of images is to use the environment variable FOR_COARRAY_NUM_IMAGES, setting this to the number of images you want to spawn. As an example, recompile hello_image.f90 without the -coarray-num-images option. Instead, before we run the executable hello_image, set the environment variable FOR_COARRAY_NUM_IMAGES to the number of images you want created during the program run. For bash shell users, set the environment variable with this command: export FOR_COARRAY_NUM_IMAGES=4 For csh/tcsh shell users, set the environment variable with this command: setenv FOR_COARRAY_NUM_IMAGES 4 For example, assuming bash shell: ifort -coarray hello_image.f90 -o hello_image export FOR_COARRAY_NUM_IMAGES=4 Hello from image 1 out of 4 total images Hello from image 3 out of 4 total images Hello from image 2 out of 4 total images Hello from image 4 out of 4 total images export FOR_COARRAY_NUM_IMAGES=3 Hello from image 3 out of 3 total images Hello from image 2 out of 3 total images Hello from image 1 out of 3 total images NOTE. Setting FOR_COARRAY_NUM_IMAGES=N overrides the -coarray_num_images compiler option. 21 Tutorial: Intel® Fortran Compiler 222 2 Intel® Fortran Composer XE 2011 Getting Started Tutorials 1 Intel® Parallel Inspector 2011 Release Notes Intel® Parallel Inspector 2011 Release Notes Installation Guide and Release Notes Document number: 320754-002US 7 August 2011 Contents Introduction What’s New System Requirements Installation Notes Issues and Limitations Attributions Disclaimer and Legal Information 1 Introduction Intel® Parallel Inspector 2011 is a serial and multithreading error checking analysis tool for Microsoft Visual Studio* C/C++ developers. Inspector detects memory leaks and errors as well as threading data races and deadlock errors. This comprehensive developer productivity tool pinpoints errors and provides guidance to help ensure application reliability and quality. This document provides system requirements, installation instructions, issues and limitations, and legal information. To learn more about this product, see the Inspector Documentation at: ? Start > All Programs > Intel Parallel Studio 2011 > Parallel Studio Documentation > Inspector Documentation. ? Or \documentation\\ documentation_inspector.htm. For example, if you install the product in the default installation path, you can find the documentation at: C:\Program Files\Intel\Parallel Studio 2011\Inspector\documentation\en\documentation_inspector.htm For Technical support, including answers to questions not addressed in the installed tool, visit the technical support forum at: http://software.intel.com/sites/support/2 Intel® Parallel Inspector 2011 Release Notes Please remember to register your tool at https://registrationcenter.intel.com/ by providing your email address. This helps Intel recognize you as a valued customer in the support forum. 2 What’s New Intel® Parallel Inspector 2011 Update 6: ? Update numbers now aligned with Intel® Inspector XE 2011. As a result, you will see the Update number skip from Update 2 in the previous release to Update 6 in this release. ? New Memory growth reporting - Use new Set Transaction Start and Set Transaction End buttons during analysis to detect if a block of memory is allocated but not deallocated within a specific time segment during application execution ? Analysis support for C# .NET applications ? New C# .NET sample code ? Added stability improvements Intel® Parallel Inspector 2011 Update 2: ? Improved GUI: ? Simpler, more intuitive real-time analysis views, main result data view, and import view ? Enhanced state management and problem filtering ? New memory overhead gauge to help choose the optimal preset analysis configuration ? Updates for Operating System and IDE support ? Added Microsoft Windows 7* SP1 ? Added Microsoft Visual Studio* 2010 SP1 ? Added stability improvements Intel® Parallel Inspector 2011 Update 1: ? Improved analysis configuration (The Collection dialog now contains three levels of analysis. Level of analysis formerly known as mi4/ti4 is now available as an additional option when you select mi3 or ti3 levels of analysis, respectively) ? New Managing Suppressions tutorial ? Bug fixes Intel® Parallel Inspector 2011: ? Microsoft Visual Studio* 2010 support ? Resource leak detection ? Intel® Cilk™ Plus support3 Intel® Parallel Inspector 2011 Release Notes ? Activation tool See http://software.intel.com/en-us/intel-parallel-inspector/ or the What’s New section in the help. 3 System Requirements For an explanation of architecture names, see http://software.intel.com/enus/articles/intel-architecture-platform-terminology/ ? A system with an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium® 4 processor or later, or compatible non-Intel processor) ? Incompatible or proprietary instructions in non-Intel processors may cause the analysis capabilities of this tool to function incorrectly. Any attempt to analyze code not supported by Intel® processors may lead to failures in this tool. ? For the best experience, a multi-core or multi-processor system is recommended. ? 2GB RAM ? 4GB free disk space for all tool features and architectures ? Software requirements ? Operating system: Microsoft Windows 7* SP1, Microsoft Windows XP* SP3, Microsoft Windows Vista* SP2, Microsoft Windows Server* 2008 SP2, 32-bit or x64 editions – embedded editions not supported. NOTE: In a future major release of this product, support for installation and use on Microsoft Windows Vista* will be removed. ? Microsoft Visual Studio* 2005 SP1, 2008 SP1 or 2010 SP1 software with C++ component installed [0] – Microsoft Visual Studio* Express Edition not supported. NOTE: In a future major release of this product, support for installation and use with Microsoft Visual Studio* 2005 will be removed. Intel recommends that customers migrate to Microsoft Visual Studio* 2010 at their earliest convenience. ? Application coding requirements ? Programming Language: C or C++ (native, not managed code) ? Threading methodologies supported by the analysis tool: ? Intel® Threading Building Blocks (Intel® TBB) ? Win32* Threads on Windows* ? OpenMP* [1] ? Intel's C/C++ Parallel Language Extensions ? Intel® Cilk™ Plus ? To view PDF documents, use a PDF reader, such as Adobe Reader*.4 Intel® Parallel Inspector 2011 Release Notes Notes: [0] Inspector supports analysis of applications built with the Intel® Parallel Composer, Intel® C++ Compiler Professional Edition version 10.0 or higher, and/or Microsoft Visual C++* 2005 SP1, 2008 SP1 or 2010 SP1 software. [1] Applications that use OpenMP* technology and are built with the Microsoft* compiler must link to the OpenMP* compatibility library as supplied by an Intel® compiler. 4 Installation Notes If you are installing the Inspector for the first time, please be sure to have the product serial number available so you can type it in during installation. Inspector updates uninstall your currently installed Inspector version, and use the existing valid Inspector license on the system. Default Installation Folders The default top-level installation folder for the Inspector is: C:\Program Files\Intel\Parallel Studio 2011\Inspector If you are installing on a system with a non-English language version of the Windows* operating system, the name of the Program Files folder may be different. On Intel® 64 architecture systems, the folder name is Program Files (x86) or the equivalent. Changing, Updating and Removing the Tool To remove, modify, or repair the Inspector: 1. Open the Control Panel. 2. Select the Add or Remove Programs applet. 3. Select Intel Parallel Inspector 2011. 4. Click the Change button. Converting Evaluation-licensed Products to Fully Licensed Products To convert your evaluation software to a fully licensed product: 1. From the start menu, click Start > All Programs > Intel Parallel Studio 2011 > Product Activation 2. Supply a valid product serial number 3. Click Activate5 Intel® Parallel Inspector 2011 Release Notes Inspector Documentation Inspector documentation is automatically integrated into supported versions of Microsoft Visual Studio*. If documentation integration does not work or disappears, follow these steps to restore documentation integration: 1. Click Start > All Programs > Intel Parallel Studio 2011 > Command Prompt and choose any shortcut (such as IA-32 Visual Studio 2005 mode). 2. Remove integration: ? “insp-vsreg –d 2005” to remove the Inspector integration with VS2005 ? “insp-vsreg –d 2008” to remove the Inspector integration with VS2008 ? “insp-vsreg –d 2010” to remove the Inspector integration with VS2010 3. Restore integration: ? “insp-vsreg –i 2005” to restore the Inspector integration with VS2005 ? “insp-vsreg –i 2008” to restore the Inspector integration with VS2008 ? “insp-vsreg –i 2010” to restore the Inspector integration with VS2010 If you still cannot access integrated Inspector documentation from the Microsoft Visual Studio* Help menu, try accessing Inspector documentation from the Start menu (Start > Intel Parallel Studio 2011 > Parallel Studio Documentation > Inspector Documentation) or directly from the Inspector Documentation Index at \documentation\\documentation_inspector.htm. Also, the Inspector Help may be unavailable in Microsoft Visual Studio* software if the language for non-Unicode programs does not match the operating system language: for example, the Japanese Windows* operating system with English language set for nonUnicode programs. Workaround: Configure the language for non-Unicode programs to match the operating system language (go to Control Panel > Regional and Language Options > tab: Advanced). 5 Issues and Limitations Installation ? Inspector may not install correctly if an installation of other software is in progress. ? If you have both Microsoft Visual Studio* 2005 and 2008 integrated development environments (IDEs) installed on your system and integrate the Intel® Parallel Studio 2011 into both IDEs, removing the integration from one IDE can remove the integrated Intel® Parallel Studio documentation from both IDEs. To work around this problem, follow the instructions provided in Installation Notes/Inspector Documentation subsection. Follow only the steps for VS2005 and VS2008.6 Intel® Parallel Inspector 2011 Release Notes General Issues ? Inspector does not guarantee this software tool will detect or report every memory and threading error in an application. ? Not all logic errors are detectable. ? Heuristics used to eliminate false positives may hide real issues. ? Highly correlated events will be grouped into a single problem. ? You can use the Inspector to analyze applications in debug and release modes. To learn more about options necessary to produce the most accurate, complete results, please refer to the following two resources: ? Memory error analysis: http://software.intel.com/en-us/articles/compiler-settingsfor-memory-error-analysis-in-intel-parallel-inspector/ ? Threading error analysis: http://software.intel.com/en-us/articles/compilersettings-for-threading-error-analysis-in-intel-parrallel-inspector/ ? If no symbols are found for a module in which a problem is detected, the Inspector displays the call stack and observation source code of the first location where it can find symbols. If it cannot find any location in the call stack with symbols, it displays the module name and relative virtual address (RVA) for the location. ? Inspector analyzes only one process in an application: the initial process created by the execution of the targeted application. This means an application launched by a script results in analysis of the script, not the process the script starts. ? Applications that crash when run outside the Inspector may crash or hang the Inspector runtime analysis engine. For example, a corrupt return address on an application call stack crashes the runtime analysis engine. If a crash occurs, problems detected prior to that time can be viewed, but memory leaks are not reported. ? Inspector uses a socket to communicate between the graphical user interface and the runtime analysis engine. Preventing an application from opening a socket prevents the Inspector from analyzing the application. ? Inspector may report an incorrect call stack following an interruption of normal call flow, such as when an exception is thrown and caught. While the Inspector recognizes and attempts to correct result data when this situation occurs, it is possible for a threading or memory problem to be reported before the call stack is fully corrected. ? You cannot obtain meaningful results if the application under analysis launches a debugger.7 Intel® Parallel Inspector 2011 Release Notes ? Synchronization, function calls and memory loads/stores that occur before the Inspector takes control of the program are not visible to the Inspector. Missing these events may cause the tool to report false positives. This situation can occur if these constructs occur in DllMain. ? When using the Help Viewer in Visual Studio 2010 SP1, if the user clicks the Where am I in the Workflow? icon in the upper-right of some Inspector help topics, to resume reading the original topic: ? Click the original tab (where the user clicked the Where am I in the Workflow? icon). ? Click its Back button. Threading Error Analysis ? Inspector may report false positives and false negatives when analyzing applications that call Microsoft Windows* ThreadpoolWait, ThreadpoolTimer, and ThreadpoolIo APIs (first introduced in the Microsoft Windows Vista* operating system) or UserMode scheduling (UMS) APIs (first introduced in the Microsoft Windows 7* operating system). ? If you use Intel® Threading Building Blocks (Intel® TBB), set the macro TBB_USE_THREADING_TOOLS at compilation time to enable correct analysis of Intel® TBB applications. Otherwise the Inspector may generate false positives during threading error analysis. If you use Intel® TBB debug libraries, do one of the following to set the macro TBB_USE_THREADING_TOOLS: ? Use the /MDd switch to set the _DEBUG preprocessor symbol (recommended). ? Set the macro TBB_USE_DEBUG. If you use Intel® TBB release libraries, set TBB_USE_THREADING_TOOLS macro. See Intel® TBB documentation for more information. ? Inspector does not detect deadlocks or potential deadlocks created with: ? Some types of locks via Intel’s C/C++ parallel extension (__critical) provided by the Intel® Parallel Composer ? Some types of locks in Intel® TBB (spin_mutex, spin_rw_mutex) ? Non-exclusive ownership synchronization objects involved, for example, condition variables, semaphores and events etc. ? Inspector may not detect threading issues on data accessed in the C runtime library (like memmove and memcpy).8 Intel® Parallel Inspector 2011 Release Notes ? Inspector does not detect inter-processes data races or deadlock/potential deadlocks. ? Inspector does not capture the main thread creation site if the .pdb symbol file is not in the location specified within the .exe or .dll executable file, or in the location containing the .exe or .dll executable file. ? Inspector may report false positives for analyzed applications using customized synchronization primitives. Memory Error Analysis ? On the 64-bit version of the Windows 7* operating system, the Inspector may show incorrect call stacks associated with memory leaks detected by the narrow (mi1) analysis setting. Any stack frames corresponding to functions in libraries/executables that call LoadLibrary() will be missing in call stacks associated with memory leaks. Workaround: Analyze your application using a wider memory analysis setting (mi2 and mi3). ? Inspector does not report memory leaks when using the narrow (mi1) analysis setting if the application under analysis circumvents the normal termination flow and does not call ExitProcess() (which is a call normally made by the runtime library when the application’s main function ends). Workaround: Analyze your application using a wider memory analysis setting (mi2 and mi3). ? Inspector does not report memory as leaked if a pointer to the memory is available in the application memory space at the time the application exits, because the application has the ability to free this memory. For example, if an application allocates a block of memory and stores a pointer to the memory in a global variable, this memory is not included in a list of reported memory leaks. Only memory that has no pointer to it is considered as a leak. ? Inspector may report false positives when the analyzed application uses custom memory allocators. ? In some circumstances, the Inspector does not record the deallocation of memory freed during application shutdown. For example, the Inspector may not record the event if memory is freed from the destructor of an object that is located in global memory, and that destructor does not execute until late in the shutdown process. Such memory may be reported as a memory leak. ? If the semantics of standard C runtime allocators are changed (the application uses non-standard versions) such that the memory returned by the allocator is initialized, the behavior of the Inspector is unknown and could lead to abnormal analysis termination.9 Intel® Parallel Inspector 2011 Release Notes ? Inspector may report mismatched allocation/deallocation for an array that appears correct with an allocation of new type[] and a matching delete[] if the code uses #include . This occurs because the underlying implementation brought in by this include file may not actually use a matched deallocation to support backward compatibility. Applications that use #include are non-conforming C++ applications. Workaround: Make the code conform by using #include (which eliminates this problem), or suppress the code. ? Narrow memory error analysis setting (mi1) may not report leaks for the memory allocated with the operator new from mfc90ud.dll (mfc90u.dll). Workaround: Copy the corresponding pdb-file (mfc90ud.i386.pdb or mfc90ud.AMD64.pdb) from the C:\WINDOWS\symbols\dll directory to the directory where mfc90ud.dll is located. ? The behavior of Memory Leak Analysis level 1 (mi1) is undefined and could lead to abnormal analysis termination if the analyzed application links with the release version of tbbmalloc.dll. Workaround: Use the debug version of tbbmalloc.dll. ? When doing Memory Error Analysis on applications that use fibers or user-level threads, the Inspector may not work properly and/or results may be incorrect in some cases. For such an application, if the “analyze stack accesses” feature is turned on, the application will not work properly and/or data collection will fail. If the “analyze stack accesses” feature is not turned on, then in some cases, incorrect call stacks may be reported. Intel® Cilk™ Plus uses fibers or user-level threads, and as such, this caveat applies to any software that uses Intel® Cilk™ Plus. Command-line Interface ? Options put in a file and passed to the insp-cl command with the -option-file option cannot use the same syntax alternatives used when entering these options on the command line. The restrictions are as follows: ? Put a newline character after the final line in the file, otherwise the final character is duplicated. ? Use ’ =’ between the option name and its value(s) For more information, please refer to Technical Support. 6 Attributions wxWindows Library This tool includes wxWindows software which can be downloaded from http://www.wxwidgets.org/downloads. wxWindows Library Licence, Version 3.1 ======================================10 Intel® Parallel Inspector 2011 Release Notes Copyright (C) 1998-2005 Julian Smart, Robert Roebling et al Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. WXWINDOWS LIBRARY LICENCE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public Licence as published by the Free Software Foundation; either version 2 of the Licence, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public Licence for more details. You should have received a copy of the GNU Library General Public Licence along with this software, usually in a file named COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. EXCEPTION NOTICE 1. As a special exception, the copyright holders of this library give permission for additional uses of the text contained in this release of the library as licenced under the wxWindows Library Licence, applying either version 3.1 of the Licence, or (at your option) any later version of the Licence as published by the copyright holders of version 3.1 of the Licence document. 2. The exception is that you may use, copy, link, modify and distribute under your own terms, binary object code versions of works based on the Library. 3. If you copy code from files distributed under the terms of the GNU General Public Licence or the GNU Library General Public Licence into a copy of this library, as this licence permits, the exception does not apply to the code that you add in this way. To avoid misleading anyone as to the status of such modified files, you must delete this exception notice from such code and/or adjust the licensing conditions notice accordingly. 4. If you write modifications of your own for this library, it is your choice whether to permit this exception to apply to your modifications. If you do not wish that, you must delete the exception notice from such code and/or adjust the licensing conditions notice accordingly11 Intel® Parallel Inspector 2011 Release Notes Libxml2 Except where otherwise noted in the source code (e.g. the files hash.c,list.c and the trio files, which are covered by a similar license but with different Copyright notices) all the files are: Copyright (C) 1998-2003 Daniel Veillard. All Rights Reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him. Boost Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS ORIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 12 Intel® Parallel Inspector 2011 Release Notes MERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NONINFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Apache Apache License - Version 2.0 – January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, 13 Intel® Parallel Inspector 2011 Release Notes elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and14 Intel® Parallel Inspector 2011 Release Notes (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NONINFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.15 Intel® Parallel Inspector 2011 Release Notes 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS 7 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.16 Intel® Parallel Inspector 2011 Release Notes The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java and all Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P.17 Intel® Parallel Inspector 2011 Release Notes Copyright © 2009-2011, Intel Corporation. All rights reserved. Intel ® Visual Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323650-001US World Wide Web: http://developer.intel.com Legal InformationContents Legal Information..................................................................................5 Introducing the Intel® Visual Fortran Composer XE 2011 ......................7 Prerequisites.........................................................................................9 Chapter 1: Navigation Quick Start Getting Started with the Intel ® Visual Fortran Composer XE 2011.....................11 Starting the Intel ® Parallel Debugger Extension..............................................11 Chapter 2: Tutorial: Intel® Fortran Compiler Using Auto Vectorization.............................................................................13 Introduction to Auto-vectorization.......................................................13 Establishing a Performance Baseline.....................................................14 Generating a Vectorization Report........................................................16 Improving Performance by Aligning Data..............................................17 Improving Performance with Interprocedural Optimization......................19 Additional Exercises...........................................................................19 Using Guided Auto-parallelization.................................................................20 Introduction to Guided Auto-parallelization...........................................20 Preparing the Project for Guided Auto-parallelization..............................20 Running Guided Auto-parallelization.....................................................21 Analyzing Guided Auto-parallelization Reports.......................................24 Implementing Guided Auto-parallelization Recommendations..................25 Using Coarry Fortran..................................................................................28 Introduction to Coarray Fortran...........................................................28 Compiling the Sample Program...........................................................28 Controlling the Number of Images.......................................................31 iiiiv Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsLegal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Centrino, Cilk, Intel, Intel Atom, Intel Core, Intel NetBurst, Itanium, MMX, Pentium, Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Portions Copyright (C) 2001, Hewlett-Packard Development Company, L.P. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. Copyright © 2011, Intel Corporation. All rights reserved. 56 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsIntroducing the Intel® Visual Fortran Composer XE 2011 This guide shows you how to start the Intel® Visual Fortran Composer XE 2011 and begin debugging code using the Intel® Parallel Debugger Extension. The Intel(R) Visual Fortran Composer XE 2011 is a comprehensive set of software development tools that includes the following components: • Intel® Fortran Compiler • Intel® Math Kernel Library • Intel® Parallel Debugger Extension Check http://software.intel.com/en-us/articles/intel-software-product-tutorials/ for the following: • ShowMe video for using Intel® Visual Fortran Composer XE with Microsoft Visual Studio* 78 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsPrerequisites You need the following tools, skills, and knowledge to effectively use these tutorials. NOTE. Although the instructions and screen captures in these tutorials refer to the Visual Studio* 2005 integrated development environment (IDE), you can use these tutorials with later versions of Visual Studio. Required Tools You need the following tools to use these tutorials: • Microsoft Visual Studio 2005 or later. • Intel ® Visual Fortran Composer XE 2011. • Sample code included with the Intel ® Visual Fortran Composer XE 2011. NOTE. • Samples are non-deterministic. Your results may vary from the examples shown throughout these tutorials. • Samples are designed only to illustrate features and do not represent best practices for creating multithreaded code. Required Skills and Knowledge These tutorials are designed for developers with a basic understanding of Microsoft Visual Studio, including how to: • open a project/solution. • access the Document Explorer. (valid in Microsoft Visual Studio 2005 /2008 ) • display the Solution Explorer. • compile and link a project. • ensure a project compiled successfully. 910 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsNavigation Quick Start 1 Getting Started with the Intel® Visual Fortran Composer XE 2011 The Intel ® Visual Fortran Composer XE 2011 integrates into the following versions of the Microsoft Visual Studio* Integrated Development Environment (IDE): • Microsoft Visual Studio 2010* • Microsoft Visual Studio 2008* • Microsoft Visual Studio 2005* If you do not have one of these Microsoft products on your system, the Intel ® Visual Fortran Composer XE 2011 installation can install Microsoft Visual Studio 2008 Shell and Libraries*. To start the Intel ® Visual Fortran Compiler XE 12.1 from Microsoft Visual Studio* IDE, perform the following steps: 1. Launch Microsoft Visual Studio*. 2. Select File > New > Project. 3. In the New Project window select a project type under Intel® Visual Fortran. 4. Select the desired template. 5. Click OK. Setting Compiler Options 1. Select Project > Properties. The Property Pages for your solution display. 2. Locate Fortran in the list and expand the heading. 3. Step through the available properties to select your configuration. The results of the compilation display in the Output window. Starting the Intel® Parallel Debugger Extension The Intel® Parallel Debugger Extension for Microsoft Visual Studio* is a debugging add-on for the Intel® Compiler's parallel code development features. It facilitates developing parallelism into applications based on the Intel® OpenMP* runtime environment. 11The Intel® Parallel Debugger Extension provides: • A new Microsoft Visual Studio* toolbar • An extension to the Microsoft Visual Studio* Debug menu • A set of new views and dialogs that are invoked from the toolbar or the menu tree The debugger features include: • Fortran language support including Fortran 95/90 • Assembler language support • Access to the registers your application accesses • Bitfield editor to modify registers • MMU support Preparing Applications for Parallel Debugging You must enable the parallel debug instrumentation with the compiler to enable parallel debugging, such as analyzing shared data or breaking at re-entrant function calls. To enable the parallel debug instrumentation: 1. Open your application project in Microsoft Visual Studio*. 2. Select Project > Properties... from the menu. The Projectname Property Pages dialog box opens. 3. Enable Parallel debug checking. 1. Select Configuration Properties > Fortran > Debugging in the left pane. 2. Under Enable Parallel Debug Checks, select Yes (/debug:parallel). 4. Click OK. 5. Rebuild your application. Your application is now instrumented for parallel debugging using the features of the Intel ® Parallel Debugger Extension. 12 1 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsTutorial: Intel® Fortran Compiler 2 Using Auto Vectorization Introduction to Auto-vectorization For the Intel® Fortran Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element at a time, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own. Using the -vec (Linux* OS) or the /Qvec (Windows* OS) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx or -m or -x (Linux and Mac OS X). Vectorization is enabled with the Intel Fortran Compiler at optimization levels of /O2 and higher. Many loops are vectorized automatically, but in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications. In this tutorial, you will: • establish a performance baseline • generate a vectorization report • improve performance by aligning data • improve performance using Interprocedural Optimization Locating the Samples To begin this tutorial, open the vec_samples.zip archive in the product's Samples directory: \Samples\\Fortran\vec_samples.zip Use these files for this tutorial: • matrix_vector_multiplication_f.sln • matrix_vector_multiplication_f.vcproj • driver.f90 • matvec.f90 Open the Microsoft Visual Studio solution file, matrix_vector_multiplication_f.sln, 13and follow the steps below to prepare the project for the vectorization exercises in this tutorial: 1. Change the Active solution configuration to Release using Build > Configuration Manager. 2. Clean the solution by selecting Build > Clean Solution. Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with these settings: 1. Select Project > Properties > Fortran > Optimization > Optimization > Minimum Size(/O1) 14 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials2. Select Project > Properties > Fortran > Data > Default Real KIND > 8(/real_size:64) 3. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured. 15 Tutorial: Intel® Fortran Compiler 2Generating a Vectorization Report A vectorization report tells you whether the loops in your code were vectorized, and if not, explains why not. Add the /Qvec-report1 option by selecting Project > Properties > Fortran > Diagnostics > Vectorizer Diagnostic Level > Loops Successefully Vectorized(1)(/Qvec-report1). Because vectorization is off at /O1, the compiler does not generate a vectorization report, so recompile at /O2 (default optimization): Select Fortran > Optimization > Optimization > Maximize Speed Record the new execution time. The reduction in time is mostly due to auto-vectorization of the inner loop at line 32 noted in the vectorization report: matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. The /Qvec-report2 option returns a list that also includes loops that were not vectorized, along with the reason why the compiler did not vectorize them. Change /Qvec-report1 to /Qvec-report2. 16 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsAlso, for Linker > Command Line > Additional Options, add /Qvec-report2: Rebuild your project. The vectorization report indicates that the loop at line 33 in matvec.f90 did not vectorize because it is not the innermost loop of the loop nest. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. NOTE. For more information on the /Qvec-report compiler option, see the Compiler Options section in the Compiler User and Reference Guide. Improving Performance by Aligning Data The vectorizer can generate faster code when operating on aligned data. In this activity you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will insert an alignment directive for a, b, and c in driver.f90 with the following syntax: !dir$attributes align : 16 :: a,b,c This instructs the compiler to create arrays that it are aligned on a 16-byte boundary, which should facilitate the use of SSE aligned load instructions. 17 Tutorial: Intel® Fortran Compiler 2In addition, the column height of the matrix a needs to be padded out to be a multiple of 16 bytes, so that each individual column of a maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the start of the arrays. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive !dir$ vector aligned NOTE. If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90 If your compilation targets the Intel ® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 32-byte aligned. Rebuild the program after adding the ALIGNED Preprocessor Definition to ensure consistently aligned data: Fortran > Preprocessor > Preprocessor Definitions Rebuild your project. matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED. matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop. matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. 18 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsImproving Performance with Interprocedural Optimization The compiler may be able to perform additional optimizations if it is able to optimize across source line boundaries. These may include, but are not limited to, function inlining. This is enabled with the /Qipo option. Rebuild the program using the /Qipo option to enable interprocedural optimization. Select Optimization > Interprocedural Optimization > Multi-file(/Qipo) Note that the vectorization messages now appear at the point of inlining in driver.f90 (line 70). driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop. driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex. driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient. driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED. driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop. driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED. driver.f90(73) (col. 16): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(70) (col. 14): remark: loop was not vectorized: not inner loop. driver.f90(70) (col. 14): remark: LOOP WAS VECTORIZED. driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED. Now, run the executable and record the execution time. Additional Exercises The previous examples made use of double precision arrays. They may be built instead with single precision arrays by changing the command-line option /real-size:64 to /real-size:32 The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements. 19 Tutorial: Intel® Fortran Compiler 2NOTE. In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail. This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques. Using Guided Auto-parallelization Introduction to Guided Auto-parallelization Guided Auto-parallelization (GAP) is a feature of the Intel® Fortran Compiler that offers selective advice and, when correctly applied, results in auto-vectorization or auto-parallelization for serially-coded applications. Using the /Qguide option with your normal compiler options at /O2 or higher is sufficient to enable the GAP technology to generate the advice for auto-vectorization. Using /Qguide in conjunction with /Qparallel will enable the compiler to generate advice for auto-parallelization. In this tutorial, you will: 1. prepare the project for Guided Auto-parallelization. 2. run Guided Auto-parallelization. 3. analyze Guided Auto-parallelization reports. 4. implement Guided Auto-parallelization recommendations. Preparing the Project for Guided Auto-parallelization To begin this tutorial, open the GuidedAutoParallel.zip archive located in the product's Samples directory located at: \Samples\\Fortran\ The following Visual Studio* 2005 project files and source files are included: • GAP-f.sln • GAP-f.vfproj • main.f90 • scalar_dep.f90 Open the Microsoft Visual Studio Solution file, GAP-f.sln, 20 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorialsand follow the steps below to prepare the project for Guided Auto-parallelization (GAP). 1. Clean the Solution by selecting Build > Clean Solution. 2. Since GAP is enabled only with option /O2 or higher, you will need to change the build configuration to Release using Build > Configuration Manager. Running Guided Auto-parallelization There are several ways to run GAP analysis in Visual Studio, depending on whether you want analysis for the whole solution, the project, a single file, a function, or a range of lines in your source code. In this tutorial, we will use single-file analysis. Follow the steps below to run a single-file analysis on scalar_dep.f90 in the GAP-f project: 1. In the GAP-f project, right-click on scalar_dep.f90. 2. Select Intel Visual Fortran Composer XE > Guided Auto Parallelism > Run Analysis on file "scalar_dep.f90" 3. If the /Qipo option is enabled, the Analysis with Multi-file optimization dialog appears. Click Run Analysis. 4. On the Configure Analysis dialog, click Run Analysis using the choices shown here: 21 Tutorial: Intel® Fortran Compiler 2NOTE. If you select Send remarks to a file, GAP messages will not be available in the Output window or Error List window. See the GAP Report in the Output window. GAP reports in the standard Output window are encapsulated with GAP REPORT LOG OPENED and END OF GAP REPORT LOG. 22 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsAlso, see the GAP Messages in the Error List window: 23 Tutorial: Intel® Fortran Compiler 2Analyzing Guided Auto-parallelization Reports Analyze the output generated by GAP analysis and determine whether or not the specific suggestions are appropriate for the specified source code. For this sample tutorial, GAP generates output for the loop in scalar_dep.f90: do i = 1, n if (a(i) >= 0) then t = i end if if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do In this example, the GAP Report generates a recommendation (remark #30761) to add the /Qparallel option to improve auto-parallelization. Remark #30515 indicates if variable t can be unconditionally assigned, the compiler will be able to vectorize the loop. 24 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsImplementing Guided Auto-parallelization Recommendations The GAP Report in this example recommends using the /Qparallel option to enable parallelization. Follow these steps to enable this option: 1. Right-click on the GAP-f project and select Properties 2. On the Property Pages dialog, expand the Fortran heading and select Optimization. 3. In the right-hand pane under, select Parallelization, then choose Yes (/Qparallel) and click OK. Now, run the GAP Analysis again and review the GAP Report: 25 Tutorial: Intel® Fortran Compiler 2Apply the necessary changes after verifying that the GAP recommendations are appropriate and do not change the semantics of the program. For this loop, the conditional compilation enables parallelization and vectorization of the loop as recommended by GAP: do i = 1, n !dir$ if defined(test_gap) t = i !dir$else if (a(i) >= 0) then t = i end if !dir$ endif if (a(i) > 0) then a(i) = t * (1 / (a(i) * a(i))) end if end do To verify that the loop is parallelized and vectorized: 1. Add the options /Qdiag-enable:par /Qdiag-enable:vec to the Command Line > Additional Options dialog. 26 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials2. Add the preprocessor definition test_gap to compile the appropriate code path. 3. Rebuild the GAP-f project and note the reports in the output window: 27 Tutorial: Intel® Fortran Compiler 2For more information on using the -guide, -vec-report, and -par-report compiler options, see the Compiler Options section in the Compiler User Guide and Reference. This completes the tutorial for Guided Auto-parallelization, where you have seen how the compiler can guide you to an optimized solution through auto-parallelization. Using Coarry Fortran Introduction to Coarray Fortran The Intel® Fortran Compiler XE supports parallel programming using coarrays as defined in the Fortran 2008 standard. As an extension to the Fortran language, coarrays offer one method to use Fortran as a robust and efficient parallel programming language. Coarray Fortran uses a single-program, multi-data programming model (SPMD). Coarrays are supported in the Intel® Fortran Compiler XE for Linux* and Intel® Visual Fortran Compiler XE for Windows*. This tutorial demonstrates how to compile a simple coarray Fortran application using the Intel Fortran Compiler XE, and how to control the number of images (processes) for the application. Locating the Sample To begin this tutorial, locate the source file in the product's Samples directory: \Samples\\Fortran\coarray_samples.zip Extract the Visual Studio project files from the .zip archive to a working directory: • coarray_samples.sln • coarray_samples.vfproj • hello_image.f90 NOTE. The Intel Fortran Compiler implementation of coarrays follows the standard provided in a draft version of the Fortran 2008 Standard. Not all features present in the Fortran 2008 Standard may be implemented by Intel. Consult the Release Notes for a list of supported features. Compiling the Sample Program The hello_image.f90 sample is a hello world program. Unlike the usual hello world, this coarray Fortran program will spawn multiple images, or processes, that will run concurrently on the host computer. Examining the source code for this application shows a simple Fortran program: program hello_image write(*,*) "Hello from image ", this_image(), & 28 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials "out of ", num_images()," total images" end program hello_image Note the function calls to this_image() and num_images(). These are new Fortran 2008 intrinsic functions. The num_images() function returns the total number of images or processes spawned for this program. The this_image() function returns a unique identifier for each image in the range 1 to N, where N is the total number of images created for this program. After installing the Intel ® Visual Fortran Composer XE 2011, start Microsoft Visual Studio* and open the coarray_samples.sln file. To build the project using coarrays, select: Project > Properties > Fortran > Command Line > /Qcoarray 29 Tutorial: Intel® Fortran Compiler 2Now, build the solution (Build > Build Solution), then run the executable (Debug > Start Without Debugging). Your output should be similar to this: Hello from image 1 out of 8 total images Hello from image 6 out of 8 total images Hello from image 7 out of 8 total images Hello from image 2 out of 8 total images Hello from image 5 out of 8 total images Hello from image 8 out of 8 total images Hello from image 3 out of 8 total images Hello from image 4 out of 8 total images By default, when a Coarray Fortran application is compiled with the Intel Fortran Compiler, the invocation creates as many images as there are processor cores on the host platform. The example shown above was run on a dual quad-core host system with eight total cores. As shown, each image is a separately spawned process on the system and executes asynchronously. NOTE. The /Qcoarray option cannot be used in conjunction with /Qopenmp options. One cannot mix Coarray Fortran language extensions with OpenMP extensions. 30 2 Intel® Visual Fortran Composer XE 2011 Getting Started TutorialsControlling the Number of Images There are two methods to control the number of images created for a Coarray Fortran application. First, you can use the /Qcoarray-num-images=N compiler option to compile the application, where N is the number of images. This option sets the number of images created for the application during run time. For example, use the /Qcoarray-num-images=2 option to the limit the number of images of the hello_image.f90 program to exactly two: To use the /Qcoarray-num-images=N option, select: Project > Properties > Fortran > Command Line > /Qcoarray-num-images=N In this example, we use /Qcoarray-num-images=2 to generate the following output: Hello from image 2 out of 2 total images Hello from image 1 out of 2 total images The second way to control the number of images is to use the environment variable FOR_COARRAY_NUM_IMAGES, setting this to the number of images you want to spawn. As an example, recompile hello_image.f90 without the /Qcoarray-num-images option. Before running the executable, set the environment variable FOR_COARRAY_NUM_IMAGES to the number of images you want created during the program run. 31 Tutorial: Intel® Fortran Compiler 2To set an environment variable in Visual Studio, select Project Properties > Configuration Properties > Debugging > Environment. Then set FOR_COARRAY_NUM_IMAGES=N where N is the number of images you want to create at runtime. Hello from image 3 out of 3 total images Hello from image 2 out of 3 total images Hello from image 1 out of 3 total images NOTE. Setting FOR_COARRAY_NUM_IMAGES=N overrides the /Qcoarray_num_images compiler option. 32 2 Intel® Visual Fortran Composer XE 2011 Getting Started Tutorials 1 Document Number: XXXXXX Intel® Rapid Storage Technology User Guide August 2011 Revision 1.0 2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL?S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Rapid Storage Technology may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Intel, Intel® Rapid Storage Technology, Intel® Matrix Storage Technology, Intel® Rapid Recover Technology, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2011, Intel Corporation. All rights reserved.3 Contents 1 Introduction............................................................................................................................................ 5 1.1 Terminology.................................................................................................................................... 5 2 Intel® Rapid Storage Technology Features............................................................................................. 8 2.1 Feature Overview........................................................................................................................ 8 2.2 RAID 0 (Striping).......................................................................................................................... 8 2.3 RAID 1 (Mirroring) ...................................................................................................................... 9 2.4 RAID 5 (Striping with Parity)..................................................................................................... 9 2.5 RAID 10........................................................................................................................................ 10 2.6 Matrix RAID ................................................................................................................................ 10 2.7 RAID Migration .......................................................................................................................... 11 2.8 RAID Level Migration ............................................................................................................... 11 2.9 Intel® Rapid Recover Technology ............................................................................................ 12 2.10 Advanced Host ControllerInterface...................................................................................... 12 2.10.1 Native Command Queuing ................................................................................................ 12 2.10.2 Hot Plug .............................................................................................................................. 13 3 RAID BIOSConfiguration .................................................................................................................. 14 3.1 Overview..................................................................................................................................... 14 3.2 Enabling RAID in BIOS............................................................................................................... 14 4 Intel®Rapid Storage Technology Option ROM..................................................................................... 15 4.1 Overview..................................................................................................................................... 15 4.2 User Interface............................................................................................................................ 15 4.3 Version Identification............................................................................................................... 16 4.4 RAID Volume Creation.............................................................................................................. 16 5 Loading Driver during Operating System Installation........................................................................... 21 5.1 Overview..................................................................................................................................... 21 5.2 F6 Installation Method................................................................................................................. 21 5.2.1 Automatic F6 Diskette Creation............................................................................................ 21 5.2.2 Manual F6 Diskette Creation ............................................................................................ 21 5.2.3 F6 Installation Steps ......................................................................................................... 224 6 Intel®Rapid Storage Technology Installation......................................................................................... 24 6.1 Overview..................................................................................................................................... 24 6.2 Where to Obtain the Software ................................................................................................... 24 6.3 Installation Steps....................................................................................................................... 25 6.4 Confirming Software Installation ........................................................................................... 27 6.5 Version Identification............................................................................................................... 28 7 RAID-Ready Setup.............................................................................................................................. 29 7.1 Overview..................................................................................................................................... 29 7.2 System Requirements................................................................................................................ 29 7.3 RAID-Ready System Setup Steps................................................................................................... 29 8 Converting RAID-Ready to Full RAID................................................................................................ 30 8.1 Overview..................................................................................................................................... 30 8.2 RAID-Ready to 2-drive RAID 1....................................................................................................... 30 9 Verify and Repair ....................................................................................................................................... 34 9.1 Overview............................................................................................................................................. 34 9.2 Actions during Verify and Repair........................................................................................................ 34 Appendix A: Error Messages....................................................................................................................... 35 A.1 Incompatible Hardware................................................................................................................ 35 A.2 Operating System Not Supported................................................................................................. 35 A.3 Source Hard Drive Cannot Be Larger ........................................................................................... 35 A.4 Hard Drive Has System Files ......................................................................................................... 35 A.5 Source Hard Drive is Dynamic Disk............................................................................................... 365 1 Introduction The purpose of this document is to enable a user to properly set up and configure a system using Intel® Rapid Storage Technology. It provides steps for set up and configuration, as well as a brief overview on Intel® Rapid Storage Technology features. The information in this document is relevant only on systems with a supported Intel chipset and a supported operating system. Supported Intel chipset and operating system information is available at the Intel® Rapid Storage Technology support web page. Note: The majority of the information in this document is related to either software configuration or hardware integration. Intel is not responsible for the software written by third party vendors or the implementation of Intel components in the products of third party manufacturers. Customers should always contact the place of purchase or system/software manufacturer with support questions about their specific hardware or software configuration. 1.1 Terminology Term Description AHCI Advanced Host Controller Interface: an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing, native hot plug, and power management. Continuous Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive automatically as long as both drives are connected to the system. Intel® Rapid Storage Technology Option ROM A code module built into the system BIOS that provides boot support for RAID volumes as well as a user interface for configuring and managing RAID volumes. Master Drive The hard drive that is the designated source drive in a recovery volume. Matrix RAID Two independent RAID volumes within a single RAID array. Member A hard drive used within a RAID array.6 Term Description Migration The process of converting a system's data storage configuration from a non-RAID configuration (pass-thru) to a RAID configuration. Hot Plug The unannounced removal and insertion of a Serial ATA hard drive while the system is powered on. NCQ Native Command Queuing: a command protocol in Serial ATA that allows multiple commands to be outstanding within a hard drive at the same time. The commands are dynamically reordered to increase hard drive performance. On Request Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive when you request it. Only changes since the last update process are copied. OS Operating System Port0 A serial ATA port (connector) on a motherboard identified as Port0. Port1 A serial ATA port (connector) on a motherboard identified as Port1. Port2 A serial ATA port (connector) on a motherboard identified as Port2. Port3 A serial ATA port (connector) on a motherboard identified as Port3. POST Power-On Self Test RAID Redundant Array of Independent Drives: allows data to be distributed across multiple hard drives to provide data redundancy or to enhance data storage performance. RAID 0 (striping) The data in the RAID volume is striped across the array's members. Striping divides data into units and distributes those units across the members without creating data redundancy, but improving read/write performance. RAID 1 (mirroring) The data in the RAID volume is mirrored across the RAID array's members. Mirroring is the term used to describe the key feature of RAID 1, which writes duplicate data to each member; therefore, creating data redundancy and increasing fault tolerance. RAID 5 (striping with parity) The data in the RAID volume and parity are striped across the array's members. Parity information is written with the data in a rotating sequence across the members of the array. This RAID level is a preferred configuration for efficiency, fault-tolerance, and performance. RAID 10 (striping and mirroring) The RAID level where information is striped across a two disk array for system performance. Each of the drives in the array has a mirror for fault tolerance. RAID 10 provides the performance benefits of RAID 0 and the redundancy of RAID 1. However, it requires four hard drives.7 Term Description RAID Array A logical grouping of physical hard drives. RAID Level Migration The process of converting a system's data storage configuration from one RAID level to another. RAID Volume A fixed amount of space across a RAID array that appears as a single physical hard drive to the operating system. Each RAID volume is created with a specific RAID level to provide data redundancy or to enhance data storage performance. Recovery Drive The hard drive that is the designated target drive in a recovery volume. Recovery Volume A volume utilizing Intel(R) Rapid Recover Technology.2 Intel® Rapid Storage Technology Features 2.1 Feature Overview The Intel® Rapid Storage Technology software package provides high-performance Serial ATA (SATA) and SATA RAID capabilities for supported operating systems. The key features of the Intel® Rapid Storage Technology are as follows: ? RAID 0 ? RAID 1 ? RAID 5 ? RAID 10 ? Matrix RAID ? RAID migration and RAID level migration ? Intel® Rapid Recover Technology ? Advanced Host Controller Interface (AHCI) support 2.2 RAID 0 (Striping) RAID 0 uses the read/write capabilities of two or more hard drives working in unison to maximize the storage performance of a computer system. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 0. RAID 0 Overview Hard Drives Required: 2-6 Advantage: Highest transfer rates Fault- tolerance: None – if one disk fails all data will be lost Application: Typically used in desktops and workstations for maximum performance for temporary data and high I/O rate. 2-drive RAID 0 available in specific mobile configurations.2.3 RAID 1 (Mirroring) A RAID 1 array contains two hard drives where the data between the two is mirrored in real time to provide good data reliability in the case of a single disk failure; when one disk drive fails, all data is immediately available on the other without any impact to the integrity of the data. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 1. RAID 1 Overview Hard Drives Required: 2 Advantage: 100% redundancy of data. One disk may fail, but data will continue to be accessible. A rebuild to a new disk is recommended to maintain data redundancy. Fault- tolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: Typically used for smaller systems where capacity of one disk is sufficient and for any application(s) requiring very high availability. Available in specific mobile configurations. 2.4 RAID 5 (Striping with Parity) A RAID 5 array contains three or more hard drives where the data and parity are striped across all the hard drives in the array. Parity is a mathematical method for recreating data that was lost from a single drive, which increases fault-tolerance. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 5. RAID 5 Overview Hard Drives Required: 3-6 Advantage: Higher percentage of usable capacity and high read performance as well as fault-tolerance. Fault- tolerance: Excellent - parity information allows data to be rebuilt after replacing a failed hard drive with a new drive. Application: Storage of large amounts of critical data. Not available in mobile configurations.2.5 RAID 10 A RAID 10 array uses four hard drives to create a combination of RAID levels 0 and 1. It is a striped set whose members are each a mirrored set. The following table provides an overview of the advantages, the level of fault-tolerance provided and the typical usage of RAID 10. RAID 10 Overview Hard Drives Required: 4 Advantage: Combines the read performance of RAID 0 with the fault-tolerance of RAID 1. Fault- tolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: High-performance applications requiring data protection, such as video editing. Not available in mobile configurations. 2.6 Matrix RAID Matrix RAID allows you to create two RAID volumes on a single RAID array. As an example, on a system with an Intel® 82801GR I/O controller hub (ICH7R), Intel® Rapid Storage Technology allows you to create both a RAID 0 volume as well as a RAID 5 volume across four Serial ATA hard drives. Example of Matrix RAID: 2.7 RAID Migration The RAID migration feature enables a properly configured PC, known as a RAID-Ready system, to be converted into a high-performance RAID 0, RAID 1, RAID 5, or RAID 10 configuration by adding one or more Serial ATA hard drives to the system and invoking the RAID migration process from within Windows. The following RAID migrations are supported: • RAID-Ready to 2,3,4,5 or 6-drive RAID 0 • RAID-Ready to 2-drive RAID 1 • RAID-Ready to 3,4,5 or 6-drive RAID 5 • RAID-Ready to 4-drive RAID 10 Note: All migrations may not be available as each migration is supported on specific platform configurations. The migrations do not require re-installation of the operating system. All applications and data remain intact. Refer to Supported RAID Migrations for more information on migrations and the platforms on which each migration is supported. 2.8 RAID Level Migration The RAID level migration feature enables a user to migrate data from a RAID 0, RAID 1, or RAID 10 volume to RAID 5 by adding any additional Serial ATA hard drives necessary and invoking the modify volume process from within Windows. The following RAID level migrations are supported: • 2-drive RAID 0 to 3,4,5 or 6-drive RAID 5 • 3-drive RAID 0 to 4,5 or 6-drive RAID 5 • 4-drive RAID 0 to 5 or 6-drive RAID 5 • 2-drive RAID 1 to 3,4,5 or 6-drive RAID 5 • 4-drive RAID 10 to 4,5 or 6-drive RAID 5 Note: All migrations may not be available as each migration is supported on specific platform configurations. RAID level migrations do not require re-installation of the operating system. All applications and data remain intact. Refer to Supported RAID Migrations for more information on migrations and the platforms on which each migration is supported.2.9 Intel® Rapid Recover Technology Intel® Rapid Recover Technology utilizes RAID 1 (mirroring) functionality to copy data from a designated master drive to a designated recovery drive. The master drive data can be copied to the recovery drive either continuously or on request. When using the continuous update policy, changes made to the data on the master drive while the system is not docked are automatically copied to the recovery drive when the system is re-docked. When using the on request update policy, the master drive data can be restored to a previous state by copying the data on the recovery drive back to the master drive. The following table provides an overview of the advantages, the disadvantages and the typical usage of Intel® Rapid Recover Technology. Recovery Volume Overview: Hard Drives Required: 2 Advantage: More control over how data is copied between master and recovery drives; fast volume updates (only changes to the master drive since the last update are copied to the recovery drive); member hard drive data can be viewed in Microsoft Windows Explorer*. Disadvantage: No increase in volume capacity. Application: Critical data protection for mobile systems; fast restoration of the master drive to a previous or default state. 2.10 Advanced Host Controller Interface Advanced Host Controller Interface (AHCI) is an interface specification that allows the storage driver to enable advanced SATA features such as Native Command Queuing and Native Hot Plug. Refer to Supported Chipsets for AHCI for more information. 2.10.1 Native Command Queuing Native Command Queuing (NCQ) is a feature supported by AHCI that allows SATA hard drives to accept more than one command at a time. NCQ, when used in conjunction with one or more hard drives that support NCQ, increases storage performance on random workloads by allowing the drive to internally optimize the order of commands. Note: To take advantage of NCQ, you need the following: • Chipset that supports AHCI • Intel® Rapid Storage Technology • One or more SATA hard drives that support NCQ2.10.2 Hot Plug Hot plug, also referred to as hot swap, is a feature supported by AHCI that allows SATA hard drives to be removed or inserted while the system is powered on and running. As an example, hot plug may be used to replace a failed hard drive that is in an externally-accessible drive enclosure. Note: To take advantage of hot plug, you need the following: • Chipset that supports AHCI • Intel® Rapid Storage Technology • Hot plug capability correctly enabled in the system BIOS by the motherboard manufacturer3 RAID BIOSConfiguration 3.1 Overview To install the Intel® Rapid Storage Technology, the system BIOS must include the SATA RAID option ROM and you must enable RAID in the BIOS. 3.2 Enabling RAID in BIOS Note: The instructions to enable RAID in the BIOS are specific to motherboards manufactured by Intel with a supported Intel chipset. The specific BIOS settings on non-Intel motherboards may differ. Refer to the motherboard documentation or contact the motherboard manufacturer or your place of purchase for specific instructions. Always follow the instructions that are provided with your motherboard. Depending on your Intel motherboard model, enable RAID by following either of the steps below. 1. Press the F2 key after the Power-On-Self-Test(POST) memory test begins. 2. Select the Configuration menu, then the SATA Drives menu. 3. Set the Chipset SATA Mode to RAID. 4. Press the F10 key to save the BIOS settings and exit the BIOS Setup program. OR 1. Press the F2 key after the Power-On-Self-Test(POST) memory test begins. 2. Select the Advanced menu, then the Drive Configuration menu. 3. Set the Drive Mode option to Enhanced. 4. Enable Intel® RAID Technology. 5. Press the F10 key to save the BIOS settings and exit the BIOS Setup program.4 Intel®Rapid Storage Technology Option ROM 4.1 Overview The Intel® Rapid Storage Technology option ROM provides the following: ? Pre-operating system user interface for RAID volume management ? Ability to create, delete and reset RAID volumes ? RAID recovery 4.2 User Interface To enter the Intel® Rapid Storage Technology option ROM user interface, press Ctrl-I when prompted during the Power-On Self Test (POST). Option ROM prompt: In the user interface, the hard drive(s) and hard drive information listed for your system will differ from the example in Figure 3. Option ROM user interface: 4.3 Version Identification To identify the version of the Intel® Rapid Storage Technology option ROM in the system BIOS, enter the option ROM user interface. The version number is located in the upper right corner. 4.4 RAID Volume Creation Use the following steps to create a RAID volume using the Intel® Rapid Storage Technology user interface: Note: The following procedure should only be used with a newly-built system or if you are reinstalling your operating system. The following procedure should not be used to migrate an existing system to RAID 0. If you wish to create matrix RAID volumes after the operating system software is loaded, they should be created using the Intel® Rapid Storage Technology software in Windows. 1. Press Ctrl-I when the following window appears during POST:2. Select option for Create RAID Volume and press Enter. 3. Type in a volume name and press Enter or press Enter to accept the default volume name.4. Select the RAID level by using the up and down arrow keys to scroll through the available values, then press Enter. 5. Press Enter to select the physical disks. A dialog similar to the following will appear:6. Select the appropriate number of hard drives by using the up and down arrow keys to scroll through the list of available hard drives. Press the Space bar to select a drive. When you have finished selecting hard drives, press Enter. 7. Unless you have selected RAID 1, select the strip size by using the up and down arrow keys to scroll through the available values and then press Enter.8. Select the volume capacity and press Enter. Note: The default value indicates the maximum volume capacity using the selected disks. If less than the maximum volume capacity is chosen, creation of a second volume is needed to utilize the remaining space (i.e. a matrix RAID configuration). 9. At the Create Volume prompt, press Enter to create the volume. The following prompt will appear: 10. Press the key to confirm volume creation. 11. Exit the option ROM user interface by selecting the Exit option. 12. Press the key again to confirm exit. Note: To change any of the information before the volume creation has been confirmed, you must exit the Create Volume process and restart it. Press the key to exit the Create Volume process.5 Loading Driver during Operating System Installation 5.1 Overview The chart below shows the circumstances in which the F6 installation method must be used during an operating system installation. Operating system Total drive volume F6 installation method Windows 7* Less than 2 Terabytes Recommended but not required 1 More than 2 Terabytes 2 Required Windows Vista* Less than 2 Terabytes Recommended but not required 1 More than 2 Terabytes 2 Required Windows XP* Less than 2 Terabytes Required More than 2 Terabytes 2 Required 1 Windows 7 and Windows Vista both include drivers for RAID/AHCI during installation. 2 For Intel® Desktop Boards, you must first enable UEFI in the BIOS when using total drive volume greater than two Terabytes. For non-Intel motherboards, refer to the motherboard documentation to see if this is a requirement. 5.2 F6 Installation Method The F6 installation method requires a 3.5” diskette with the driver files. 5.2.1 Automatic F6 Diskette Creation To automatically create a diskette that contains the files needed during the F6 installation process, follow these steps: 1. Download the latest F6 Driver Diskette utility from Download Center: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Run the .EXE file. 3. Follow all on-screen prompts. Note: Choose either the 32-bit or the 64-bit version, depending on your operating system. 5.2.2 Manual F6 Diskette Creation To manually create a diskette that contains the files needed during the F6 installation process, follow these steps: 1. Download the Intel® Rapid Storage Technology and save it to your local drive (or use the CD shipped with your motherboard which contains the Intel® Rapid Storage Technology). Note: The Intel® Rapid Storage Technology can be downloaded from Download Center at http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Extract the driver files at the command prompt by running the following command: {filename} –A -P {path} Example: IATA_CD_10.6.0.1022.EXE –A –P C:\TEMP 3. The following directory structure will be created: \Drivers \x32 \x64 4. Copy the IAAHCI.CAT, IAACHI.INF, IASTOR.CAT, IASTOR.INF, IASTOR.SYS, and TXTSETUP.OEM files to the root directory of a diskette. Note: If the system has a 32-bit processor, copy the files found in the \x32 folder; if the system has a 64-bit processor, copy the files found in the \x64 folder. 5.2.3 F6 Installation Steps To install the Intel® Rapid Storage Technology driver using the F6 installation method, complete the following steps: 1. Press F6 at the beginning of Windows setup when prompted in the status line with the „Press F6 if you need to install a third party SCSI or RAID driver? message. 2. After pressing F6, nothing will happen immediately; setup will temporarily continue loading drivers and then you will be prompted with a screen to load support for mass storage device(s). Press S to „Specify Additional Device?.3. Enter> key. Refer to the Automatic F6 Diskette Creation section above for instructions. 4. Select the RAID or AHCI controller entry that corresponds to your BIOS setup and press Enter. Note: Not all available selections may appear in the list; use the up and down arrow keys to see additional options. 5. Press Enter to confirm. Windows setup will now continue. Leave the diskette in the diskette drive until the system reboots itself because Windows setup will need to copy the files again from the diskette. After Windows setup has copied these files again, remove the diskette so that Windows setup can reboot as needed.6 Intel®Rapid Storage Technology Installation 6.1 Overview After installing an operating system onto a RAID volume or on a SATA hard drive when in RAID or AHCI mode, the Intel® Rapid Storage Technology can be loaded from within Windows. This installs the following components: ? User interface (i.e. Intel® Rapid Storage Technology software) ? Tray icon service ? Monitor service, allowing you to monitor the health of your RAID volume and/or hard drives. Warning:The Intel® Rapid Storage Technology driver may be used to operate the hard drive from which the system is booting or a hard drive that contains important data. For this reason, you cannot remove or un-install this driver from the system; however, you will have the ability to uninstall all other non-driver components. The following non-driver components can be un-installed: • Intel® Rapid Storage Technology software • Help documentation • Start Menu shortcuts • System tray icon service • RAID monitor service 6.2 Where to Obtain the Software If a CD or DVD was included with your motherboard or system, it should include the Intel® Rapid Storage Technology software. The latest version of Intel® Rapid Storage Technology can also be downloaded from Download Center at: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=21016.3 Installation Steps Note: The instructions below assume that the BIOS has been configured correctly and the RAID driver has been installed using the F6 installation method (if applicable). 1. Run the Intel® Rapid Storage Technology installation file. 2. On the welcome screen, click Next to continue. 3. Review the Warning screen and click Next to continue.4. Review the License Agreement and click Yes to accept the license agreement terms. 5. Review the Readme File Information and click Next to continue.6. Click Finish to complete the installation and restart the system. 6.4 Confirming Software Installation Refer to the image below to confirm that Intel® Rapid Storage Technology has been installed. If installation was done using F6 or an unattended installation method, you can confirm that the Intel® Rapid Storage Technology was loaded by following these steps: Note: The following instructions assume Classic mode in Windows* XP. 1. Click the Start button and then Control Panel. 2. Double-click the System icon. 3. Select the Hardware tab. 4. Click the Device Manager button. 5. Expand the SCSI and RAID Controllers entry. 6. Right-click the SATA RAID Controller entry. 7. Select the Driver tab. 8. Click the Driver Details button. The iastor.sys file should be listed. Example: Refer to Figure 5. Driver details example: NOTE: The controller shown here may differ from the controller displayed for your system. 6.5 Version Identification 1. Open the Intel® Rapid Storage Technology software. 2. Click the Help button and then the About button. NOTE: The version information shown here may differ from the information displayed for your system.7 RAID-Ready Setup 7.1 Overview A RAID-Ready system is a system configuration that allows a user to perform a RAID migration at a later date. For more information on RAID migrations, see the RAID Migration section of this User Guide (Section 8). 7.2 System Requirements In order for a system to be considered RAID-Ready, it must meet all of the following requirements: • Contains a supported Intel chipset • Includes a single SATA hard drive • RAID must be enabled in the BIOS • Motherboard BIOS must include the Intel® Rapid Storage Technology option ROM • Intel® Rapid Storage Technology must be loaded • A partition that does not take up the entire capacity of the hard drive (4-5MB of free space is sufficient) 7.3 RAID-Ready System Setup Steps To set up a RAID-Ready system, follow these steps: 1. Enable RAID in system BIOS using the steps listed in Enabling RAID in BIOS (Section 3.2). 2. Install the Intel® Rapid Storage Technology driver using the steps listed in F6 Installation Steps (Section 5.2.3) 3. Install Intel® Rapid Storage Technology using the steps listed in Installation Steps (Section 6.3)8 Converting RAID-Ready to Full RAID 8.1 Overview This section explains how to convert (or migrate) from a RAID-Ready system to a fully-functional RAID system. The example in this section describes the migration steps for RAID 1. 8.2 RAID-Ready to 2-drive RAID 1 To convert a RAID-Ready system into a system with a 2-drive RAID 1 volume, follow these steps: Warning:This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. 1. Install an additional SATA hard drive in the system. 2. Start Windows and open the Intel® Rapid Storage Technology software. 3. Select Create a custom volume.4. On the Select Volume Type screen, select Real-time data protection (RAID 1) and then click Next. 5. On the Configure Volume screen: a. Select the two installed disks b. Choose to keep data on the “System” disk c. Click Next6. Review the warning screen and then click Create Volume. 7. Review the confirmation screen and then click OK.8. After the volume has been created, click OK on the completion screen. 9. Review the Status screen, now showing the RAID array just created. 10. The data migration will begin and may take some time. During the migration, you can see the current status by holding the mouse pointer over the Intel® Rapid Storage Technology status bar icon.9 Verify and Repair 9.1 Overview Verify and Repair checks a volume for inconsistent or bad data. It may also fix any data problems or parity errors. The Verify process happens… ? Automatically after a hard system shutdown or system crash (except when configured for RAID 0) ? Manually when started from within the Intel® Rapid Storage Technology software The UI displays two functions: Verify Only Verify and Repair 9.2 Actions during Verify and Repair The Verify process checks each stripe rather than copying data. The driver walks through every stripe in the volume, starting at the lowest logical block address (LBA). Array Type Actions RAID 0 Verify: checks for any read failures. Repair: can?t repair since there is not a copy of good data RAID 1 Verify: checks for data mismatches and read failures Repair: copies to mirror RAID 5 Verify: checks for parity issues and read failures Repair: updates parity; assumes the data is correct and regenerates and rewrites parity RAID 10 Verify: checks for data mismatches and read failures Repair: copies to mirrorAppendix A: Error Messages A.1 Incompatible Hardware Issue: The following error message appears during installation: Incompatible hardware. This software is not supported on this chipset. Please select „Yes? to view the Readme file for a list of supported products. Refer to section 2 titled „System Requirements?. To resolve this issue, install the Intel® Rapid Storage Technology software on a system with a supported Intel chipset or by ensuring that AHCI or RAID is enabled in the system BIOS. A.2 Operating System Not Supported Issue: The following error message appears during installation: This operating system is not currently supported by this install package. Installer will now exit. To resolve this issue, install the Intel® Rapid Storage Technology software on a supported operating system. A.3 Source Hard Drive Cannot Be Larger Issue: When attempting to migrate from a single hard drive (or a RAID-Ready configuration) to a RAID configuration, the following error message appears and the migration process will not begin: The source hard drive cannot be larger than the selected hard drive member(s). Do one of the following to correct the problem: - If already inserted, select larger hard drive member(s). - Insert larger hard drive(s) into the system, and re-launch the Create RAID Volume from Existing Hard Drive Wizard. Follow the steps listed in the error message to resolve the problem. A.4 Hard Drive Has System Files Issue: The following error message appears after selecting a hard drive as a member hard drive during the Create RAID Volume process: This hard drive has system files and cannot be used to create a RAID volume. Please select another hard drive. Solution: Select a new hard drive.A.5 Source Hard Drive is Dynamic Disk Issue: When attempting to migrate from a RAID-Ready system to a full-RAID system, an error message is received that says the migration cannot continue because the source drive is a dynamic disk. However, Microsoft* Windows* Disk Management shows the disk as basic, not dynamic. This issue may occur if there is not enough space for the migration to successfully complete. Instead of reporting that there is not enough space, the Intel Rapid Storage Technology software reports that the migration cannot continue because the source drive is a dynamic disk. Note: This error is not related to the size of the destination hard drive(s). It may be received even if the destination hard drive(s) are equal to or greater in size than the source hard drive. To resolve this issue: ? If there is a single partition on the source hard drive, reducing the size of the partition by a few MBs may resolve the issue and allow the migration to occur. ? If there are multiple partitions on the source hard drive, reducing the size of the second partition by a few MBs may resolve the issue and allow the migration to occur. Document Number: XXXXXX Intel® Matrix Storage Manager 8.x User's Manual January 2009 Revision 1.02 ver7.0 / User's Manual INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL?S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Matrix Storage Manager may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Intel, Intel® Matrix Storage Manager, Intel® Matrix Storage Technology, Intel® Rapid Recover Technology, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2008, Intel Corporation. All rights reserved.Contents ver7.0 / User's Manual 3 1 Introduction...................................................................................................... 6 1.1 Terminology........................................................................................... 6 1.2 Reference Documents.............................................................................. 8 2 Intel® 2.1 Matrix Storage Manager Features .............................................................. 9 Feature Overview.................................................................................... 9 2.2 RAID 0 (Striping) .................................................................................... 9 2.3 RAID 1 (Mirroring) .................................................................................10 2.4 RAID 5 (Striping with Parity) ...................................................................10 2.5 RAID 10................................................................................................11 2.6 Matrix RAID ..........................................................................................11 2.7 RAID Migration ......................................................................................12 2.8 RAID Level Migration..............................................................................12 2.9 Intel® Rapid Recover Technology ............................................................13 2.10 Advanced Host Controller Interface ..........................................................14 2.10.1 Native Command Queuing .........................................................14 2.10.2 Hot Plug ..................................................................................14 3 RAID BIOS Configuration ...................................................................................15 3.1 Overview ..............................................................................................15 3.2 Enabling RAID in BIOS............................................................................15 4 Intel® Matrix Storage Manager Option ROM.........................................................16 4.1 Overview ..............................................................................................16 4.2 User Interface .......................................................................................16 4.3 Version Identification .............................................................................16 4.4 RAID Volume Creation............................................................................17 5 Loading Driver During OS Installation..................................................................22 5.1 Overview ..............................................................................................22 5.2 F6 Installation Method ............................................................................22 5.2.1 Automatic F6 Floppy Creation.....................................................22 5.2.2 Manual F6 Floppy Creation.........................................................22 5.2.3 F6 Installation Steps .................................................................23 6 Intel® Matrix Storage Manager Installation..........................................................24 6.1 Overview ..............................................................................................24 6.2 Where to Obtain Software.......................................................................24 6.3 Installation Steps...................................................................................24 6.4 How to Confirm Software Installation .......................................................29 6.5 Version Identification .............................................................................31 6.5.1 Version Identification Using Intel® Matrix Storage Console............31 6.5.2 Version Identification Using Driver File ........................................31 7 RAID-Ready Setup............................................................................................324 ver7.0 / User's Manual 7.1 Overview ..............................................................................................32 7.2 System Requirements ............................................................................32 7.3 RAID-Ready System Setup Steps.............................................................32 8 RAID Migration.................................................................................................33 8.1 Overview ..............................................................................................33 8.2 RAID Migration Steps: RAID-Ready to 2-drive RAID 0/1 .............................33 8.3 RAID Migration Steps: RAID-Ready to 3 or 4-drive RAID 0/5.......................35 9 Volume Creation...............................................................................................42 9.1 RAID Volume Creation............................................................................42 9.2 Recovery Volume Creation ......................................................................49 9.2.1 Recovery Volume Creation in Basic Mode.....................................49 9.2.2 Recovery Volume Creation in Advanced Mode...............................50 Appendix A Error Messages.................................................................................................56 A.1 Incompatible Hardware .....................................................................................56 A.2 Operating System Not Supported .......................................................................56 A.3 Source Hard Drive Cannot Be Larger ...................................................................56 A.4 Hard Drive Has System Files ..............................................................................57 A.5 Source Hard Drive is Dynamic Disk .....................................................................57 Figures Figure 1. Matrix RAID........................................................................................12 Figure 2. User Prompt .......................................................................................16 Figure 3. Start Menu Item..................................................................................30 Figure 4. Driver Details Example.........................................................................30 Figure 5. Driver Version Information ...................................................................31 Figure 6. Tray Icon Status .................................................................................34 Figure 7. User Interface Status...........................................................................35 Figure 8. Progress Dialog...................................................................................35 Tables Table 1. RAID 0 Overview..................................................................................10 Table 2. RAID 1 Overview..................................................................................10 Table 3. RAID 5 Overview..................................................................................11 Table 4. RAID 10 Overview ................................................................................11 Table 5. Recovery Volume Overview ...................................................................13ver7.0 / User's Manual 5 Revision History Document Number Revision Number Description Revision Date N/A 1.0 Aligns with 8.x release • Clarified RAID-Ready requirements January 2009 §6 ver7.0 / User's Manual Introduction 1 Introduction The purpose of this document is to enable a user to properly set up and configure a system using Intel® Matrix Storage Manager. It provides steps for set up and configuration, as well as a brief overview on Intel® Matrix Storage Manager features. Note: The information in this document is only relevant on systems with a supported Intel chipset that include a supported Intel chipset, with a supported operating system. Supported Intel chipset and operating system information is available at the Intel® Rapid Storage Technology support web page. Note: The majority of the information in this document is related to either software configuration or hardware integration. Intel is not responsible for the software written by third party vendors or the implementation of Intel components in the products of third party manufacturers. Customers should always contact the place of purchase or system/software manufacturer with support questions about their specific hardware or software configuration. 1.1 Terminology Term Description AHCI Advanced Host Controller Interface: an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing, native hot plug, and power management. Continuous Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive automatically as long as both drives are connected to the system. Intel® Matrix Storage Manager Option ROM A code module built into the system BIOS that provides boot support for RAID volumes as well as a user interface for configuring and managing RAID volumes. Master Drive The hard drive that is the designated source drive in a recovery volume. Matrix RAID Two independent RAID volumes within a single RAID array. Member A hard drive used within a RAID array.ver7.0 / User's Manual 7 Introduction Term Description Migration The process of converting a system's data storage configuration from a non-RAID configuration (pass-thru) to a RAID configuration. Hot Plug The unannounced removal and insertion of a Serial ATA hard drive while the system is powered on. NCQ Native Command Queuing: a command protocol in Serial ATA that allows multiple commands to be outstanding within a hard drive at the same time. The commands are dynamically reordered to increase hard drive performance. On Request Update Policy When a recovery volume is using this policy, data on the master drive is copied to the recovery drive when you request it. Only changes since the last update process are copied. OS Operating System Port0 A serial ATA port (connector) on a motherboard identified as Port0. Port1 A serial ATA port (connector) on a motherboard identified as Port1. Port2 A serial ATA port (connector) on a motherboard identified as Port2. Port3 A serial ATA port (connector) on a motherboard identified as Port3. POST Power-On Self Test RAID Redundant Array of Independent Drives: allows data to be distributed across multiple hard drives to provide data redundancy or to enhance data storage performance. RAID 0 (striping) The data in the RAID volume is striped across the array's members. Striping divides data into units and distributes those units across the members without creating data redundancy, but improving read/write performance. RAID 1 (mirroring) The data in the RAID volume is mirrored across the RAID array's members. Mirroring is the term used to describe the key feature of RAID 1, which writes duplicate data to each member; therefore, creating data redundancy and increasing fault tolerance. RAID 5 (striping with parity) The data in the RAID volume and parity are striped across the array's members. Parity information is written with the data in a rotating sequence across the members of the array. This RAID level is a preferred configuration for efficiency, fault-tolerance, and performance. RAID 10 (striping and mirroring) The RAID level where information is striped across a two disk array for system performance. Each of the drives in the array has a mirror for fault tolerance. RAID 10 provides the performance benefits of RAID 0 and the redundancy of RAID 1. However, it requires four hard drives. RAID Array A logical grouping of physical hard drives.8 ver7.0 / User's Manual Introduction Term Description RAID Level Migration The process of converting a system's data storage configuration from one RAID level to another. RAID Volume A fixed amount of space across a RAID array that appears as a single physical hard drive to the operating system. Each RAID volume is created with a specific RAID level to provide data redundancy or to enhance data storage performance. Recovery Drive The hard drive that is the designated target drive in a recovery volume. Recovery Volume A volume utilizing Intel(R) Rapid Recover Technology. 1.2 Reference Documents Document Document No./Location Not Applicablever7.0 / User's Manual 9 Intel® Matrix Storage Manager Features 2 Intel® Matrix Storage Manager Features 2.1 Feature Overview The Intel® Matrix Storage Manager software package provides high-performance Serial ATA and Serial ATA RAID capabilities for supported operating systems. The key features of the Intel® Matrix Storage Manager are as follows: • RAID 0 • RAID 1 • RAID 5 • RAID 10 • Matrix RAID • RAID migration and RAID level migration • Intel® Rapid Recover Technology • Advanced Host Controller Interface (AHCI) support 2.2 RAID 0 (Striping) RAID 0 uses the read/write capabilities of two or more hard drives working in unison to maximize the storage performance of a computer system. Table 1 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 0.Intel® Matrix Storage Manager Features 10 ver7.0 / User's Manual Table 1. RAID 0 Overview Hard Drives Required: 2-6 Advantage: Highest transfer rates Faulttolerance: None – if one disk fails all data will be lost Application: Typically used in desktops and workstations for maximum performance for temporary data and high I/O rate. 2-drive RAID 0 available in specific mobile configurations. 2.3 RAID 1 (Mirroring) A RAID 1 array contains two hard drives where the data between the two is mirrored in real time to provide good data reliability in the case of a single disk failure; when one disk drive fails, all data is immediately available on the other without any impact to the integrity of the data. Table 2 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 1. Table 2. RAID 1 Overview Hard Drives Required: 2 Advantage: 100% redundancy of data. One disk may fail, but data will continue to be accessible. A rebuild to a new disk is recommended to maintain data redundancy. Faulttolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: Typically used for smaller systems where capacity of one disk is sufficient and for any application(s) requiring very high availability. Available in specific mobile configurations. 2.4 RAID 5 (Striping with Parity) A RAID 5 array contains three or more hard drives where the data and parity are striped across all the hard drives in the array. Parity is a mathematical method for recreating data that was lost from a single drive, which increases fault-tolerance. Table 3 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 5.Intel® Matrix Storage Manager Features ver7.0 / User's Manual 11 Table 3. RAID 5 Overview Hard Drives Required: 3-6 Advantage: Higher percentage of usable capacity and high read performance as well as fault-tolerance. Faulttolerance: Excellent - parity information allows data to be rebuilt after replacing a failed hard drive with a new drive. Application: Storage of large amounts of critical data. Not available in mobile configurations. 2.5 RAID 10 A RAID 10 array uses four hard drives to create a combination of RAID levels 0 and 1. It is a striped set whose members are each a mirrored set. Table 4 provides an overview of the advantages, the level of fault-tolerance provided, and the typical usage of RAID 10. Table 4. RAID 10 Overview Hard Drives Required: 4 Advantage: Combines the read performance of RAID 0 with the fault-tolerance of RAID 1. Faulttolerance: Excellent – disk mirroring means that all data on one disk is duplicated on another disk. Application: High-performance applications requiring data protection, such as video editing. Not available in mobile configurations. 2.6 Matrix RAID Matrix RAID allows you to create two RAID volumes on a single RAID array. As an example, on a system with an Intel® 82801GR I/O controller hub (ICH7R), Intel® Matrix Storage Manager allows you to create both a RAID 0 volume as well as a RAID 5 volume across four Serial ATA hard drives. Example: Refer to Figure 1.Intel® Matrix Storage Manager Features 12 ver7.0 / User's Manual Figure 1. Matrix RAID 2.7 RAID Migration The RAID migration feature enables a properly configured PC, known as a RAID-Ready system, to be converted into a high-performance RAID 0, RAID 1, RAID 5, or RAID 10 configuration by adding one or more Serial ATA hard drives to the system and invoking the RAID migration process from within Windows. The following RAID migrations are supported: Note: All migrations may not be available as each migration is supported on specific platform configurations. • RAID-Ready to 2,3,4,5 or 6-drive RAID 0 • RAID-Ready to 2-drive RAID 1 • RAID-Ready to 3,4,5 or 6-drive RAID 5 • RAID-Ready to 4-drive RAID 10 The migrations do not require re-installation of the operating system. All applications and data remain intact. 2.8 RAID Level Migration The RAID level migration feature enables a user to migrate data from a RAID 0, RAID 1, or RAID 10 volume to RAID 5 by adding any additional Serial ATA hard drives necessary and invoking the modify volume process from within Windows.Intel® Matrix Storage Manager Features ver7.0 / User's Manual 13 The following RAID level migrations are supported: Note: All migrations may not be available as each migration is supported on specific platform configurations. • 2-drive RAID 0 to 3,4,5 or 6-drive RAID 5 • 3-drive RAID 0 to 4,5 or 6-drive RAID 5 • 4-drive RAID 0 to 5 or 6-drive RAID 5 • 2-drive RAID 1 to 3,4,5 or 6-drive RAID 5 • 4-drive RAID 10 to 4,5 or 6-drive RAID 5 RAID level migrations do not require re-installation of the operating system. All applications and data remain intact. 2.9 Intel® Rapid Recover Technology Intel® Rapid Recover Technology utilizes RAID 1 (mirroring) functionality to copy data from a designated master drive to a designated recovery drive. The master drive data can be copied to the recovery drive either continuously or on request. When using the continuous update policy, changes made to the data on the master drive while the system is not docked are automatically copied to the recovery drive when the system is re-docked. When using the on request update policy, the master drive data can be restored to a previous state by copying the data on the recovery drive back to the master drive. Table 5 provides an overview of the advantages, the disadvantages, and the typical usage of Intel® Rapid Recover Technology. Table 5. Recovery Volume Overview Hard Drives Required: 2 Advantage: More control over how data is copied between master and recovery drives; fast volume updates (only changes to the master drive since the last update are copied to the recovery drive); member hard drive data can be viewed in Microsoft Windows Explorer*. Disadvantage: No increase in volume capacity. Application: Critical data protection for mobile systems; fast restoration of the master drive to a previous or default state.Intel® Matrix Storage Manager Features 14 ver7.0 / User's Manual 2.10 Advanced Host Controller Interface Advanced Host Controller Interface (AHCI) is an interface specification that allows the storage driver to enable advanced Serial ATA features such as Native Command Queuing and Native Hot Plug. 2.10.1 Native Command Queuing Native Command Queuing (NCQ) is a feature supported by AHCI that allows Serial ATA hard drives to accept more than one command at a time. NCQ, when used in conjunction with one or more hard drives that support NCQ, increases storage performance on random workloads by allowing the drive to internally optimize the order of commands. Note: To take advantage of NCQ, you need the following: • Chipset that supports AHCI • Intel® Matrix Storage Manager • One or more Serial ATA (SATA) hard drives that support NCQ 2.10.2 Hot Plug Hot plug, also referred to as hot swap, is a feature supported by AHCI that allows Serial ATA hard drives to be removed or inserted while the system is powered on and running. As an example, hot plug may be used to replace a failed hard drive that is in an externally-accessible drive enclosure. Note: To take advantage of hot plug, you need the following: • Chipset that supports AHCI • Intel® Matrix Storage Manager • Hot plug capability correctly enabled in the system BIOS by the OEM/motherboard manufacturerver7.0 / User's Manual 15 RAID BIOS Configuration 3 RAID BIOS Configuration 3.1 Overview To install the Intel® Matrix Storage Manager, the system BIOS must include the Intel® Matrix Storage Manager option ROM. The Intel® Matrix Storage Manager option ROM is tied to the controller hub. Version 7.0 of the option ROM supports platforms based on the Intel® 82801HEM I/O controller hub. 3.2 Enabling RAID in BIOS Use the following steps to enable RAID in the system BIOS: Note: The instructions listed below are specific to motherboards manufactured by Intel with a supported Intel chipset. The specific BIOS settings on non-Intel manufactured motherboards may differ. Refer to the motherboard documentation or contact the motherboard manufacturer or your place of purchase for specific instructions. Always follow the instructions that are provided with your motherboard. 1. Press the key after the Power-On-Self-Test (POST) memory test begins. 2. Select the Advanced menu, then the Drive Configuration menu. 3. Switch the Drive Mode option from Legacy to Enhanced. 4. Enable Intel(R) RAID Technology. 5. Press the key to save the BIOS settings and exit the BIOS Setup program.16 ver7.0 / User's Manual Intel® Matrix Storage Manager Option ROM 4 Intel® Matrix Storage Manager Option ROM 4.1 Overview The Intel® Matrix Storage Manager option ROM is a PnP option ROM that provides a pre-operating system user interface for RAID configurations. It also provides BIOS and DOS disk services (Int13h). 4.2 User Interface To enter the Intel® Matrix Storage Manager option ROM user interface, press the and keys simultaneously when prompted during the Power-On Self Test (POST). Example: Refer to Figure 2. Figure 2. User Prompt NOTE: The hard drive(s) and hard drive information listed for your system can differ from the following example. 4.3 Version Identification To identify the specific version of the Intel® Matrix Storage Manager option ROM integrated into the system BIOS, enter the option ROM user interface. The versionver7.0 / User's Manual 17 Intel® Matrix Storage Manager Option ROM number is located in the top right corner with the following format: vX.Y.W.XXXX, where X and Y are the major and minor version numbers. 4.4 RAID Volume Creation Use the following steps to create a RAID volume using the Intel® Matrix Storage Manager user interface: Note: The following procedure should only be used with a newly-built system or if you are reinstalling your operating system. The following procedure should not be used to migrate an existing system to RAID 0. If you wish to create matrix RAID volumes after the operating system software is loaded, they should be created using the Intel® Matrix Storage Console in Windows. 1. Press the and keys simultaneously when the following window appears during POST: 2. Select option 1. Create RAID Volume and press the key.18 ver7.0 / User's Manual Intel® Matrix Storage Manager Option ROM 3. Type in a volume name and press the key, or press the key to accept the default name. 4. Select the RAID level by using the < > or < > keys to scroll through the available values, then press the key.ver7.0 / User's Manual 19 Intel® Matrix Storage Manager Option ROM 5. Press the key to select the physical disks. A dialog similar to the following will appear: 6. Select the appropriate number of hard drives by using the < > or < > keys to scroll through the list of available hard drives. .Press the key to select a drive. When you have finished selecting hard drives, press the key.Intel® Matrix Storage Manager Option ROM 20 ver7.0 / User's Manual 7. Unless you have selected RAID 1, select the strip size by using the < > or < > keys to scroll through the available values, then press the key. 8. Select the volume capacity and press the key. Note: The default value indicates the maximum volume capacity using the selected disks. If less than the maximum volume capacity is chosen, creation of a second volume is needed to utilize the remaining space (i.e. a matrix RAID configuration).Intel® Matrix Storage Manager Option ROM ver7.0 / User's Manual 21 9. At the Create Volume prompt, press the key to create the volume. The following prompt will appear: 10. Press the key to confirm volume creation. 11. To exit the option ROM user interface, select option 5. Exit and press the key. 12. Press the key again to confirm exit. Note: To change any of the information before the volume creation has been confirmed, you must exit the Create Volume process and restart it. Press the key to exit the Create Volume process.22 ver7.0 / User's Manual Loading Driver During OS Installation 5 Loading Driver During OS Installation 5.1 Overview Unless using Microsoft Windows Vista*, the Intel® Matrix Storage Manager driver must be loaded during operating system installation using the F6 installation method. This is required in order to install an operating system onto a hard drive or RAID volume when in RAID mode or onto a hard drive when in AHCI mode. If using Microsoft Windows Vista, this is not required, as the operating system includes a driver for the AHCI and RAID controllers. Refer to Intel® Matrix Storage Manager Installation for instructions on how to installed an updated version of the software after the operating system is installed. 5.2 F6 Installation Method The F6 installation method requires a floppy with the driver files. 5.2.1 Automatic F6 Floppy Creation Use the following steps to automatically create a floppy that contains the files needed during the F6 installation process: 1. Download the latest Floppy Configuration Utility from the Intel download site: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Run the .EXE file. Note: Use F6flpy32.exe on a 32-bit system. Use F6flpy64.exe on a 64-bit system. 5.2.2 Manual F6 Floppy Creation Use the following steps to manually create a floppy that contains the files needed during the F6 installation process: 1. Download the Intel® Matrix Storage Manager and save it to your local drive (or use the CD shipped with your motherboard which contains the Intel® Matrix Storage Manager). Note: The Intel® Matrix Storage Manager can be downloaded from the following website: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 2. Extract the driver files by running 'C:\IATAXX_CD.EXE –A -A -P C:\'. Note: This is described in the „Advanced Installation Instructions? section of the README.TXT.ver7.0 / User's Manual 23 Loading Driver During OS Installation 3. Copy the IAAHCI.CAT, IAACHI.INF, IASTOR.CAT, IASTOR.INF, IASTOR.SYS, and TXTSETUP.OEM. files to the root directory of a floppy diskette. Note: If the system has a 32-bit processor, copy the files found in the Drivers folder; if the system has a 64-bit processor, copy the files found in the Drivers64 folder. 5.2.3 F6 Installation Steps To install the Intel® Matrix Storage Manager driver using the F6 installation method, complete the following steps: 1. Press the key at the beginning of Windows XP setup (during text-mode phase) when prompted in the status line with the „Press F6 if you need to install a third party SCSI or RAID driver? message. Note: After pressing F6, nothing will happen immediately; setup will temporarily continue loading drivers and then you will be prompted with a screen to load support for mass storage device(s). 2. Press the key to „Specify Additional Device?. 3. Insert the floppy disk containing the driver files when you see the following prompt: „Please insert the disk labeled Manufacturer-supplied hardware support disk into Drive A:? and press the key. Refer to Automatic F6 Floppy Creation for instructions. 4. Select the RAID or AHCI controller entry that corresponds to your BIOS setup and press the key. Note: Not all available selections may appear in the list; use the < > or < > to see additional options. 5. Press the key to confirm. At this point, you have successfully F6 installed the Intel® Matrix Storage Manager driver and Windows XP setup should continue. Leave the floppy disk in the floppy drive until the system reboots itself because Windows setup will need to copy the files again from the floppy to the Windows installation folders. After Windows setup has copied these files again, remove the floppy diskette so that Windows setup can reboot as needed.24 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 6 Intel® Matrix Storage Manager Installation 6.1 Overview After installing an operating system onto a RAID volume or on a Serial ATA hard drive when in RAID or AHCI mode, the Intel® Matrix Storage Manager can be loaded from within Windows. This installs the user interface (i.e. Intel® Matrix Storage Console), the tray icon service, and the monitor service onto the system, allowing you to monitor the health of your RAID volume and/or hard drives. This method can also be used to upgrade to a newer version of the Intel® Matrix Storage Manager. Warning: The Intel® Matrix Storage Manager driver may be used to operate the hard drive from which the the system is booting or a hard drive that contains important data. For this reason, you cannot remove or un-install this driver from the system; however, you will have the ability to un-install all other non-driver components. The following non-driver components can be un-installed: • Intel® Matrix Storage Console • Help documentation • Start Menu shortcuts • System tray icon service • RAID monitor service 6.2 Where to Obtain Software If a CD-ROM was included with your motherboard or system, it should include the Intel® Matrix Storage Manager. The Intel® Matrix Storage Manager can be downloaded from the following Intel website: http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2101 6.3 Installation Steps Note: The instructions below assume that the BIOS has been configured correctly and the RAID driver has been installed using the F6 installation method (if applicable).ver7.0 / User's Manual 25 Intel® Matrix Storage Manager Installation 1. Run the Intel® Matrix Storage Manager executable. 2. Click Next to continue.26 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 3. Carefully review the warning and click Next to continue.ver7.0 / User's Manual 27 Intel® Matrix Storage Manager Installation 4. Click Yes to accept the license agreement terms.28 ver7.0 / User's Manual Intel® Matrix Storage Manager Installation 5. Review the readme if needed and click Next to continue.ver7.0 / User's Manual 29 Intel® Matrix Storage Manager Installation 6. Click Finish to complete installation and reboot the system. 6.4 How to Confirm Software Installation Refer to Figure 3 to confirm that Intel® Matrix Storage Manager has been installed.Intel® Matrix Storage Manager Installation 30 ver7.0 / User's Manual Figure 3. Start Menu Item If installation was done by have-disk, F6, or an unattended installation method, you can confirm that the Intel® Matrix Storage Manager has been loaded by completing the following steps: Note: The following instructions assume Classic mode in Windows* XP Professional. 1. Click on the Start button and then the Control Panel entry. 2. Double-click the System icon. Note: If using Microsoft Windows Vista, first select Classic View. 3. Select the Hardware tab. 4. Click on the Device Manager button. 5. Expand the SCSI and RAID Controllers entry. 6. Right-click on the Intel(R) 82801XX SATA Controller entry. 7. Select the Driver tab. 8. Click on the Driver Details button. The iastor.sys file should be listed. Example: Refer to Figure 4. Figure 4. Driver Details Example NOTE: The controller shown here may differ from the controller displayed for your system.Intel® Matrix Storage Manager Installation ver7.0 / User's Manual 31 6.5 Version Identification There are two ways to determine which version of the Intel® Matrix Storage Manager is installed: 1. Use the Intel® Matrix Storage Console 2. Locate the RAID driver (iaStor.sys) file and view the file properties 6.5.1 Version Identification Using Intel® Matrix Storage Console 1. To access the Intel® Matrix Storage Console, refer to Figure 3. 2. Under the View menu, select System Report. 3. Select the Intel® RAID Technology tab for the driver version information. Example: Refer to Figure 5. Figure 5. Driver Version Information NOTE: Driver version information shown here may differ from the information displayed for your system. 6.5.2 Version Identification Using Driver File 1. Locate the file iastor.sys in the following path: \Windows\System32\Drivers 2. Right-click on iastor.sys and select Properties. 3. Select the Version tab. The version number should be listed after the File Version parameter in the following format: x.y.z.aaaa32 ver7.0 / User's Manual RAID-Ready Setup 7 RAID-Ready Setup 7.1 Overview A "RAID Ready" system is a specific system configuration that allows a user to perform a RAID migration at a later date. For more information on RAID migrations, see RAID Migration. 7.2 System Requirements In order for a system to be considered “RAID Ready”, it must meet all of the following requirements: • Contains a supported Intel chipset • Includes a single Serial ATA (SATA) hard drive • RAID controller must be enabled in the BIOS • Motherboard BIOS must include the Intel® Matrix Storage Manager option ROM • Intel® Matrix Storage Manager must be loaded • A partition that does not take up the entire capacity of the hard drive (4-5MB of free space is sufficient) 7.3 RAID-Ready System Setup Steps Note: The system must meet all the requirements listed in System Requirements. 1. Enable RAID in system BIOS using the steps listed in Enabling RAID in BIOS. 2. Install Intel® Matrix Storage Manager driver using the steps listed in F6 Installation Steps. 3. Install Intel® Matrix Storage Manager using the steps listed in Installation Steps.ver7.0 / User's Manual 33 RAID Migration 8 RAID Migration 8.1 Overview The following sections explain how to migrate from a RAID-Ready system to a RAID system. 8.2 RAID Migration Steps: RAID-Ready to 2-drive RAID 0/1 Use the following steps to convert a RAID-Ready system into a system with a 2-drive RAID 0 or 1 volume: Note: The steps listed in this section assume that the system is a properly configured RAIDReady system. For more information on how to configure a RAID-Ready system, see RAID-Ready System Setup Steps. Warning: This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. 1. Physically add an additional SATA hard drive to the system. 2. Boot into Windows* and open the Intel® Matrix Storage Console. Example: Refer to Figure 3. 3. Select Protect data from a hard drive failure with RAID 1 or Improve storage performance with RAID 0.34 ver7.0 / User's Manual RAID Migration 4. Select Yes to confirm volume creation. In the following example, RAID 1 was selected. Refer to Figure 6, Figure 7, and Figure 8 for examples of volume creation progress indicators. 5. When the migration is complete, reboot the system if needed. 6. If applicable, use a third party application or the Microsoft* Windows* operating system tools to create and format a new data partition in any unused space or use a third party application to extend the partition to utilize any remaining space. Figure 6. Tray Icon Statusver7.0 / User's Manual 35 RAID Migration Figure 7. User Interface Status Figure 8. Progress Dialog 8.3 RAID Migration Steps: RAID-Ready to 3 or 4- drive RAID 0/5 Use the following steps to convert a RAID-Ready system into a system with a 3 or 4- drive RAID 0/5 volume: Note: The steps listed in this section assume that the system is a properly configured RAIDReady system. For more information on how to configure a RAID-Ready system, see RAID-Ready System Setup Steps.36 ver7.0 / User's Manual RAID Migration Warning: This operation will delete all existing data from the additional hard drive or drives and the data cannot be recovered. It is critical to backup all important data on the additional drives before proceeding. The data on the source hard drive, however, will be preserved. Warning: It is very important to note which disk is the source drive (the one containing all of the information to be migrated). On a RAID-Ready system, this can be determined by noting the port the single hard drive is attached to a note during POST. You can also use the Intel® Matrix Storage Manager before the additional disks are installed to verify the port and serial number of the drive that contains the data. 1. Physically add two or three additional SATA hard drives to the system. 2. Boot into Windows* and open the Intel® Matrix Storage Console. Example: Refer to Figure 3. 3. Select Advanced Mode from the View menu. 4. Select Create RAID Volume from Existing Hard Drive from the Actions menu. 5. Click Next to continue. 6. Type in a volume name and press the key, or press the key to accept the default name.ver7.0 / User's Manual 37 RAID Migration 7. Select a RAID level. 8. Select a strip size. 9. Click Next to continue. 10. Select a source hard drive source. Note: The source hard drive can be selected by double-clicking on the hard drive, or by single-clicking on the hard drive and then selecting the right arrow key. The data on this hard drive will be preserved and38 ver7.0 / User's Manual RAID Migration migrated to the new RAID volume. 11. Click Next to continue. 12. Select the member hard drives. Note: The member hard drives can be selected by double-clicking on the hard drive, or by single-clicking on the hard drive and thenver7.0 / User's Manual 39 RAID Migration selecting the right arrow key. Warning: The data on the member hard drives will be deleted. Back up all important data before continuing. 13. Click Next to continue. 14. Use the field or the slider bar to specify the amount of available array space that will be used by the volume. Note: Any remaining space can be used to create aRAID Migration 40 ver7.0 / User's Manual second volume.RAID Migration ver7.0 / User's Manual 41 15. Click Finish to begin the migration process. 16. Once the migration is complete, reboot if needed. 17. If applicable, use a third party application or the Microsoft* Windows* operating system tools to create and format a new data partition in any unused space or use a third party application to extend the partition to utilize any remaining space.42 ver7.0 / User's Manual Volume Creation 9 Volume Creation RAID and recovery volumes can be created using the Intel® Matrix Storage Console. Note: RAID volume creation is only available as an option if you are have two or more SATA hard drives in addition to another bootable device. If you wish to create a RAID volume using your boot device, you will need to perform a RAID migration. See RAID Migration for instructions on how to perform a migration. 9.1 RAID Volume Creation Warning: Creating a RAID volume will permanently delete any existing data on the selected hard drives. Back up all important data before beginning these steps. If you wish to preserve the data, see RAID Migration for instructions on how to perform a RAID migration. To create a RAID volume, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console) 2. Switch to advanced mode by selecting the Advanced Mode option under the View menu. 3. Select Create RAID Volume under the Actions menu.ver7.0 / User's Manual 43 Volume Creation 4. Select Next.44 ver7.0 / User's Manual Volume Creation 5. Enter a name for the RAID volume.ver7.0 / User's Manual 45 Volume Creation 6. Select a RAID level.46 ver7.0 / User's Manual Volume Creation 7. Select a strip size. 8. Select Next to continue.ver7.0 / User's Manual 47 Volume Creation 9. Select the hard drives that will be used to create the RAID volume. 10. When you are finished selecting hard drives, select Next to continue.48 ver7.0 / User's Manual Volume Creation 11. Enter a size for the RAID volume. 12. Select Next to continue.ver7.0 / User's Manual 49 Volume Creation 13. Select Finish to create the RAID volume. 9.2 Recovery Volume Creation A recovery volume can be created using either Basic mode or Advanced mode in the Intel® Matrix Storage Console. 9.2.1 Recovery Volume Creation in Basic Mode Warning: Creating a recovery volume will permanently delete any existing data on the drive selected as the recovery drive. Back up all important data before beginning these steps. Note: This option may or may not be available depending on your system configuration. If you do not see the option listed, refer to Recovery Volume Creation in Advanced Mode. To create a recovery volume in Basic mode, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console)Volume Creation 50 ver7.0 / User's Manual 2. Select Protect data using Intel® Rapid Recover Technology. 3. Select Yes to confirm volume creation. 9.2.2 Recovery Volume Creation in Advanced Mode Warning: Creating a recovery volume will permanently delete any existing data on the drive selected as the recovery drive. Back up all important data before beginning these steps. To create a recovery volume in Basic mode, use the following steps: 1. Open the Intel® Matrix Storage Console. (Start >> All Programs >> Intel® Matrix Storage Manager >> Intel® Matrix Storage Console) 2. Select Advanced Mode in the View menu.Volume Creation ver7.0 / User's Manual 51 3. 3. Select Create Recovery Volume in the Actions menu. 4. Select Next to continue. 5. Modify the recovery volume name if you wish.Volume Creation 52 ver7.0 / User's Manual 6. Select a hard drive to be used as the master hard drive for the recovery volume.Volume Creation ver7.0 / User's Manual 53 7. Select a hard drive to be used as the recovery hard drive for the recovery volume.Volume Creation 54 ver7.0 / User's Manual 8. Select an update policy.Volume Creation ver7.0 / User's Manual 55 9. Select Finish to begin recovery volume creation.Volume Creation 56 ver7.0 / User's Manual Appendix A Error Messages A.1 Incompatible Hardware Issue: The following error message appears during installation: Solution: This issue can be resolved by installing the Intel® Matrix Storage Manager on a system with a supported Intel chipset, or by ensuring that AHCI or RAID is enabled in the system BIOS. A.2 Operating System Not Supported Issue: The following error message appears during installation: Solution: This issue can be resolved by installing the Intel® Matrix Storage Manager on a supported operating system. A.3 Source Hard Drive Cannot Be Larger Issue: When attempting to migrate from a single hard drive (or a RAID-Ready configuration) to a RAID configuration, the following error message appears and the migration process will not begin:Volume Creation ver7.0 / User's Manual 57 Solution: Follow the steps listed in the error message to resolve the problem. A.4 Hard Drive Has System Files Issue: The following error message appears after selecting a hard drive as a member hard drive during the Create RAID Volume process: Solution: Select a new hard drive. A.5 Source Hard Drive is Dynamic Disk Issue: When attempting to migrate from a RAID-Ready configuration to a RAID configuration, an error message is received that says the migration cannot continue because the source drive is a dynamic disk. However, Microsoft* Windows* Disk Management shows the disk as basic, not dynamic.Volume Creation 58 ver7.0 / User's Manual Solution: Reduce the size of the partition by a few MBs and see if that resolves the issue.

10% de réduction sur vos envois d'emailing --> CLIQUEZ ICI

Retour à l'accueil, cliquez ici

Documentation INTEL Rechercher un produit INTEL :

http://software.intel.com/sites/products/search/search.php?q=&x=26&y=18&product=&version=&docos=

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Linux* OS User's Guide http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/mkl_userguide_lnx.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Mac OS* X User's Guide http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_mac/mkl_userguide_mac.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-018US http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_win/mkl_userguide_win.pdf

Accéder au manuel utilisateur

Intel ® Math Kernel Library Reference Manual Document Number: 630813-045US MKL 10.3 Update 8 http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/mklman.pdf

Accéder au manuel utilisateur

Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Linux* OS Document Number: 324207-005US

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/start/getting_started_amplifier_xe_linux.pdf Intel(R) VTune(TM) Amplifier XE 2011 Getting Started Tutorials for Windows* OS

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/win/start/getting_started_amplifier_xe_windows.pdf Intel® VTune™ Amplifier XE 2011 Release Notes for Linux Installation Guide and Release Notes Document number: 323591-001U

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/start/release_notes_amplifier_xe_linux.pdf Intel® VTune™ Amplifier XE 2011 Release Notes for Windows* OS Installation Guide and Release Notes Document number: 323401-001U

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/win/start/release_notes_amplifier_xe_windows.pdf Intel(R) Threading Building Blocks Reference Manual

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/tbbxe/Reference.pdf Intel® Threading Building Blocks Design Patterns Design Patterns Document Number 323512-005U

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/tbbxe/Design_Patterns.pdf Intel® Parallel Studio 2011 SP1 Installation Guide and Release Notes Document number: 321604-003US 24 July 201

http://software.intel.com/sites/products/documentation/studio/studio/en-us/2011Update/release_notes_studio.pdf Intel® Math Kernel Library Summary Statistics Application Note

http://software.intel.com/sites/products/documentation/hpc/mkl/sslnotes/sslnotes.pdf Intel® Math Kernel Library Vector Statistical Library Notes

http://software.intel.com/sites/products/documentation/hpc/mkl/vslnotes/vslnotes.pdf

Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323648-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/lin/getting_started_composerxe2011_cpp_lin.pdf Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323649-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/mac/getting_started_composerxe2011_cpp_mac.pdf Intel ® C++ Composer XE 2011 Getting Started Tutorials Document Number: 323647-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/win/getting_started_composerxe2011_cpp_win.pdf Intel ® Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/lin/getting_started_composerxe2011_for_lin.pdf Intel® Parallel Inspector 2011 Release Notes Installation Guide and Release Notes Document number: 320754-002U

http://software.intel.com/sites/products/documentation/studio/inspector/en-us/2011Update/start/release_notes_inspector.pdf Intel ® Visual Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323650-001US

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/start/win/getting_started_composerxe2011_for_win.pdf Intel® Rapid Storage Technology User Guide August 2011 Revision 1.

http://download.intel.com/support/chipsets/imsm/sb/irst_user_guide.pdf Intel® Matrix Storage Manager 8.x User's Manual January 2009 Revision 1.

http://download.intel.com/support/chipsets/imsm/sb/8_x_raid_ahci_users_manual.pdf Intel ® Math Kernel Library for Linux* OS User's Guide Intel® MKL - Linux* OS Document Number: 314774-019US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables...................................................................17 Scripts to Set Environment Variables .................................................18 Automating the Process of Setting Environment Variables.....................19 Compiler Support.....................................................................................19 Using Code Examples...............................................................................20 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................20 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23 High-level Directory Structure....................................................................23 Layered Model Concept.............................................................................24 Accessing the Intel ® Math Kernel Library Documentation...............................25 Contents of the Documentation Directories..........................................26 Viewing Man Pages..........................................................................26 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................27 Using the -mkl Compiler Option.........................................................27 Using the Single Dynamic Library.......................................................28 Selecting Libraries to Link with..........................................................28 Using the Link-line Advisor................................................................29 Using the Command-line Link Tool.....................................................29 Linking Examples.....................................................................................29 Linking on IA-32 Architecture Systems...............................................29 Linking on Intel(R) 64 Architecture Systems........................................30 Linking in Detail.......................................................................................31 Listing Libraries on a Link Line...........................................................31 Dynamically Selecting the Interface and Threading Layer......................32 Linking with Interface Libraries..........................................................33 Using the ILP64 Interface vs. LP64 Interface...............................33 Linking with Fortran 95 Interface Libraries..................................35 Linking with Threading Libraries.........................................................35 Sequential Mode of the Library..................................................35 Contents 3Selecting the Threading Layer...................................................36 Linking with Computational Libraries..................................................37 Linking with Compiler Run-time Libraries............................................37 Linking with System Libraries............................................................38 Building Custom Shared Objects................................................................38 Using the Custom Shared Object Builder.............................................38 Composing a List of Functions ..........................................................39 Specifying Function Names...............................................................40 Distributing Your Custom Shared Object.............................................40 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................41 Threaded Functions and Problems......................................................41 Avoiding Conflicts in the Execution Environment..................................43 Techniques to Set the Number of Threads...........................................44 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................44 Changing the Number of Threads at Run Time.....................................44 Using Additional Threading Control.....................................................46 Intel MKL-specific Environment Variables for Threading Control. . . . .46 MKL_DYNAMIC........................................................................47 MKL_DOMAIN_NUM_THREADS..................................................48 Setting the Environment Variables for Threading Control..............49 Tips and Techniques to Improve Performance..............................................49 Coding Techniques...........................................................................50 Hardware Configuration Tips.............................................................50 Managing Multi-core Performance......................................................51 Operating on Denormals...................................................................52 FFT Optimized Radices.....................................................................52 Using Memory Management ......................................................................52 Intel MKL Memory Management Software............................................52 Redefining Memory Functions............................................................53 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................55 Interface Libraries and Modules.........................................................55 Fortran 95 Interfaces to LAPACK and BLAS..........................................57 Compiler-dependent Functions and Fortran 90 Modules.........................57 Mixed-language Programming with the Intel Math Kernel Library....................58 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................58 Using Complex Types in C/C++.........................................................59 Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................60 Support for Boost uBLAS Matrix-matrix Multiplication...........................61 Invoking Intel MKL Functions from Java* Applications...........................62 Intel MKL Java* Examples........................................................62 Running the Java* Examples.....................................................64 Known Limitations of the Java* Examples...................................65 Chapter 7: Coding Tips Intel® Math Kernel Library for Linux* OS User's Guide 4Aligning Data for Consistent Results...........................................................67 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................68 Chapter 8: Working with the Intel® Math Kernel Library Cluster Software Linking with ScaLAPACK and Cluster FFTs....................................................69 Setting the Number of Threads..................................................................70 Using Shared Libraries..............................................................................71 Building ScaLAPACK Tests.........................................................................71 Examples for Linking with ScaLAPACK and Cluster FFT..................................71 Examples for Linking a C Application..................................................71 Examples for Linking a Fortran Application..........................................72 Chapter 9: Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) Configuring the Eclipse* IDE CDT to Link with Intel MKL ...............................73 Getting Assistance for Programming in the Eclipse* IDE ...............................73 Viewing the Intel ® Math Kernel Library Reference Manual in the Eclipse* IDE................................................................................74 Searching the Intel Web Site from the Eclipse* IDE..............................74 Chapter 10: LINPACK and MP LINPACK Benchmarks Intel ® Optimized LINPACK Benchmark for Linux* OS.....................................77 Contents of the Intel ® Optimized LINPACK Benchmark..........................77 Running the Software.......................................................................78 Known Limitations of the Intel ® Optimized LINPACK Benchmark.............79 Intel ® Optimized MP LINPACK Benchmark for Clusters...................................79 Overview of the Intel ® Optimized MP LINPACK Benchmark for Clusters....79 Contents of the Intel ® Optimized MP LINPACK Benchmark for Clusters. . . .80 Building the MP LINPACK..................................................................82 New Features of Intel ® Optimized MP LINPACK Benchmark....................82 Benchmarking a Cluster....................................................................83 Options to Reduce Search Time.........................................................83 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................87 Include Files............................................................................................88 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................91 FFTW Interface Support............................................................................91 Appendix C: Directory Structure in Detail Detailed Structure of the IA-32 Architecture Directories................................93 Static Libraries in the lib/ia32 Directory..............................................93 Dynamic Libraries in the lib/ia32 Directory..........................................94 Detailed Structure of the Intel ® 64 Architecture Directories............................95 Static Libraries in the lib/intel64 Directory...........................................95 Dynamic Libraries in the lib/intel64 Directory.......................................97 Contents 5Intel® Math Kernel Library for Linux* OS User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2006 - 2011, Intel Corporation. All rights reserved. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for 7Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Linux* OS User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Linux* OS User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. The Intel MKL documentation integrates into the Eclipse* integrated development environment (IDE). See Getting Assistance for Programming in the Eclipse* IDE . 11 Intel® Math Kernel Library for Linux* OS User's Guide 12Notational Conventions The following term is used in reference to the operating system. Linux* OS This term refers to information that is valid on all supported Linux* operating systems. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Fortran Composer XE . The main directory where Intel MKL is installed: =/mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase Indicates filenames, directory names, and pathnames, for example: ./benchmarks/ linpack Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl -liomp5 -lpthread • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Linux* OS User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Linux OS programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product. Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Linux* OS Release Notes. 151 Intel® Math Kernel Library for Linux* OS User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the /bin directory and its subdirectories: mklvars.sh mklvars.csh ia32/mklvars_ia32.sh ia32/mklvars_ia32.csh intel64/mklvars_intel64.sh intel64/mklvars_intel64.csh Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, launch an Intel MKL example, as explained in Using Code Examples. See Also Notational Conventions Setting Environment Variables See Also Setting the Number of Threads Using an OpenMP* Environment Variable 17Scripts to Set Environment Variables When the installation of Intel MKL for Linux* OS is complete, set the INCLUDE, MKLROOT, LD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory. Choose the script corresponding to your system architecture and command shell as explained in the following table: Architecture Shell Script File IA-32 C ia32/mklvars_ia32.csh IA-32 Bash and Bourne (sh) ia32/mklvars_ia32.sh Intel® 64 C intel64/mklvars_intel64.csh Intel® 64 Bash and Bourne (sh) intel64/mklvars_intel64.sh IA-32 and Intel® 64 C mklvars.csh IA-32 and Intel® 64 Bash and Bourne (sh) mklvars.sh Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the FPATH environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the scriptname (regardless of the extension). The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32.sh sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64.sh mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the FPATH environment variable. • The command mklvars.sh intel64 mod 2 Intel® Math Kernel Library for Linux* OS User's Guide 18sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the FPATH environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable Automating the Process of Setting Environment Variables To automate setting of the INCLUDE, MKLROOT, LD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables, add mklvars*.*sh to your shell profile so that each time you login, the script automatically executes and sets the paths to the appropriate Intel MKL directories. To do this, with a local user account, edit the following files by adding the appropriate script to the path manipulation section right before exporting variables: Shell Files Commands bash ~/.bash_profile, ~/.bash_login or ~/.profile # setting up MKL environment for bash . /bin [/]/mklvars[].sh [] [mod] [lp64|ilp64] sh ~/.profile # setting up MKL environment for sh . /bin [/]/mklvars[].sh [] [mod] [lp64|ilp64] csh ~/.login # setting up MKL environment for sh . /bin [/]/mklvars[].csh [] [mod] [lp64|ilp64] In the above commands, replace with ia32 or intel64. If you have super user permissions, add the same commands to a general-system file in /etc/profile (for bash and sh) or in /etc/csh.login (for csh). CAUTION Before uninstalling Intel MKL, remove the above commands from all profile files where the script execution was added. Otherwise you may experience problems logging in. See Also Scripts to Set Environment Variables Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. Getting Started 2 19See Also Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples/spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples/vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Because Intel MKL libraries are located in directories corresponding to your particular architecture (see Architecture Support), you should provide proper paths on your link lines (see Linking Examples). To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS • LAPACK • PBLAS • ScaLAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions • Fourier Transform functions (FFT) • Cluster FFT • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release 2 Intel® Math Kernel Library for Linux* OS User's Guide 20Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Additionally, if you are using the Intel MKL cluster software, your link line is function-domain specific (see Working with the Cluster Software). Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static • Dynamic Reason: The link line syntax and libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. MPI used Decide what MPI you will use with the Intel MKL cluster software. You are strongly encouraged to use Intel® MPI 3.2 or later. MPI used Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries corresponding to your particular MPI should be listed on the link line (see Working with the Cluster Software). Getting Started 2 212 Intel® Math Kernel Library for Linux* OS User's Guide 22Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Linux* OS provides two architecture-specific implementations. The following table lists the supported architectures and directories where each architecture-specific implementation is located. Architecture Location IA-32 or compatible /lib/ia32 Intel® 64 or compatible /lib/intel64 See Also High-level Directory Structure Detailed Structure of the IA-32 Architecture Directories Detailed Structure of the Intel® 64 Architecture Directories High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin Scripts to set environmental variables in the user shell bin/ia32 Shell scripts for the IA-32 architecture bin/intel64 Shell scripts for the Intel® 64 architecture benchmarks/linpack Shared-memory (SMP) version of the LINPACK benchmark benchmarks/mp_linpack Message-passing interface (MPI) version of the LINPACK benchmark examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples 23Directory Contents include/ia32 Fortran 95 .mod files for the IA-32 architecture and Intel® Fortran compiler include/intel64/lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and LP64 interface include/intel64/ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and ILP64 interface include/fftw Header files for the FFTW2 and FFTW3 interfaces interfaces/blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces/fftw2x_cdft MPI FFTW 2.x interfaces to the Intel MKL Cluster FFTs interfaces/fftw3x_cdft MPI FFTW 3.x interfaces to the Intel MKL Cluster FFTs interfaces/fftw2xc FFTW 2.x interfaces to the Intel MKL FFTs (C interface) interfaces/fftw2xf FFTW 2.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces/fftw3xc FFTW 3.x interfaces to the Intel MKL FFTs (C interface) interfaces/fftw3xf FFTW 3.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces/lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library lib/ia32 Static libraries and shared objects for the IA-32 architecture lib/intel64 Static libraries and shared objects for the Intel® 64 architecture tests Source and data files for tests tools Tools and plug-ins tools/builder Tools for creating custom dynamically linkable libraries tools/plugins/ com.intel.mkl.help Eclipse* IDE plug-in with Intel MKL Reference Manual in WebHelp format. See mkl_documentation.htm for more information Subdirectories of Documentation/en_US/mkl Intel MKL documentation. man/en_US/man3 Man pages for Intel MKL functions. No directory for man pages is created in locales other than en_US even if a directory for the localized documentation is created in the respective locales. For more information, see Contents of the Documentation Directories. See Also Notational Conventions Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer 3 Intel® Math Kernel Library for Linux* OS User's Guide 24You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, GNU*, and so on). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Accessing the Intel® Math Kernel Library Documentation Structure of the Intel® Math Kernel Library 3 25Contents of the Documentation Directories Most of Intel MKL documentation is installed at /Documentation// mkl. For example, the documentation in English is installed at / Documentation/en_US/mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in /Documentation /clicense or /flicense Common end user license for the Intel® C++ Composer XE 2011 or Intel® Fortran Composer XE 2011, respectively mklsupport.txt Information on package number for customer support reference Contents of /Documentation//mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual/index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide/index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor Viewing Man Pages To access Intel MKL man pages, add the man pages directory to the MANPATH environment variable. If you performed the Setting Environment Variables step of the Getting Started process, this is done automatically. To view the man page for an Intel MKL function, enter the following command in your command shell: man In this release, is the function name with omitted prefixes denoting data type, task type, or any other field that may vary for this function. Examples: • For the BLAS function ddot, enter man dot • For the ScaLAPACK function pzgeql2, enter man pgeql2 • For the statistical function vslConvSetMode, enter man vslSetMode • For the VML function vdPackM , enter man vPack • For the FFT function DftiCommitDescriptor, enter man DftiCommitDescriptor NOTE Function names in the man command are case-sensitive. See Also High-level Directory Structure Setting Environment Variables 3 Intel® Math Kernel Library for Linux* OS User's Guide 26Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application, which depend on the way you link: Using the Intel® Composer XE compiler see Using the -mkl Compiler Option. Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the -mkl Compiler Option The Intel® Composer XE compiler supports the following variants of the -mkl compiler option: -mkl or -mkl=parallel to link with standard threaded Intel MKL. -mkl=sequential to link with sequential version of Intel MKL. -mkl=cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the -mkl compiler option, see the Intel Compiler User and Reference Guides. On Intel® 64 architecture systems, for each variant of the -mkl option, the compiler links your application using the LP64 interface. If you specify any variant of the -mkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. See Also Listing Libraries on a Link Line Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor 27Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place libmkl_rt.so on your link line. For example: ic? application.c -lmkl_rt SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking libmkl_intel.a libmkl_intel_ thread.a libmkl_core.a libiomp5.so IA-32 architecture, dynamic linking libmkl_intel. so libmkl_intel_ thread.so libmkl_core. so libiomp5.so Intel® 64 architecture, static linking libmkl_intel_ lp64.a libmkl_intel_ thread.a libmkl_core.a libiomp5.so Intel® 64 architecture, dynamic linking libmkl_intel_ lp64.so libmkl_intel_ thread.so libmkl_core. so libiomp5.so The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures libmkl_rt.so libiomp5.so † † Use the Link-line Advisor to check whether you need to explicitly link the libiomp5.so RTL. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept 4 Intel® Math Kernel Library for Linux* OS User's Guide 28Using the Link-line Advisor Using the -mkl Compiler Option Working with the Intel® Math Kernel Library Cluster Software Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool is installed in the /tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Examples for Linking with ScaLAPACK and Cluster FFT Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib/ia32, MKLINCLUDE=$MKLROOT/include : • Static linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL: Linking Your Application with the Intel® Math Kernel Library 4 29ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_sequential.a $MKLPATH/ libmkl_core.a -Wl,--end-group -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_sequential -lmkl_core -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_lapack95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_blas95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -Wl,--end-group -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib/intel64, MKLINCLUDE=$MKLROOT/include: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread 4 Intel® Math Kernel Library for Linux* OS User's Guide 30• Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_blas95_lp64 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Listing Libraries on a Link Line To link with Intel MKL, specify paths and libraries on the link line as shown below. Linking Your Application with the Intel® Math Kernel Library 4 31NOTE The syntax below is for dynamic linking. For static linking, replace each library name preceded with "-l" with the path to the library file. For example, replace -lmkl_core with $MKLPATH/ libmkl_core.a, where $MKLPATH is the appropriate user-defined environment variable. -L -I [-I/{ia32|intel64|{ilp64|lp64}}] [-lmkl_blas{95|95_ilp64|95_lp64}] [-lmkl_lapack{95|95_ilp64|95_lp64}] [ ] -lmkl_{intel|intel_ilp64|intel_lp64|intel_sp2dp|gf|gf_ilp64|gf_lp64} -lmkl_{intel_thread|gnu_thread|pgi_thread|sequential} -lmkl_core -liomp5 [-lpthread] [-lm] In case of static linking, enclose the cluster components, interface, threading, and computational libraries in grouping symbols (for example, -Wl,--start-group $MKLPATH/libmkl_cdft_core.a $MKLPATH/ libmkl_blacs_intelmpi_ilp64.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group). The order of listing libraries on the link line is essential, except for the libraries enclosed in the grouping symbols above. See Also Using the Link-line Advisor Linking Examples Working with the Intel® Math Kernel Library Cluster Software Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. 4 Intel® Math Kernel Library for Linux* OS User's Guide 32Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL GNU threading GNU MKL_THREADING_GNU PGI threading PGI MKL_THREADING_PGI If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. See Also Using the Single Dynamic Library Layered Model Concept Directory Structure in Detail Linking with Interface Libraries Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • libmkl_intel_lp64.a or libmkl_intel_ilp64.a for static linking • libmkl_intel_lp64.so or libmkl_intel_ilp64.so for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the -i8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Linking Your Application with the Intel® Math Kernel Library 4 33Fortran Compiling for ILP64 ifort -i8 -I/include ... Compiling for LP64 ifort -I/include ... C or C++ Compiling for ILP64 icc -DMKL_ILP64 -I/include ... Compiling for LP64 icc -I/include ... CAUTION Linking of an application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. 4 Intel® Math Kernel Library for Linux* OS User's Guide 34To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Directory Structure in Detail Linking with Fortran 95 Interface Libraries The libmkl_blas95*.a and libmkl_lapack95*.a libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. Add the POSIX threads library (pthread) to your link line for the sequential mode because the *sequential.* library depends on pthread . See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Linking Your Application with the Intel® Math Kernel Library 4 35Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel, gnu and PGI* compilers on Linux OS). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Linux OS (GNU). That is, a program threaded with a GNU compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter libmkl_intel_ thread.a libiomp5.so PGI Yes libmkl_pgi_ thread.a or libmkl_ sequential.a PGI* supplied Use of libmkl_sequential.a removes threading from Intel MKL calls. PGI No libmkl_intel_ thread.a libiomp5.so PGI No libmkl_pgi_ thread.a PGI* supplied PGI No libmkl_ sequential.a None gnu Yes libmkl_gnu_ thread.a libiomp5.so or GNU OpenMP run-time library libiomp5 offers superior scaling performance. gnu Yes libmkl_ sequential.a None gnu No libmkl_intel_ thread.a libiomp5.so other Yes libmkl_ sequential.a None other No libmkl_intel_ thread.a libiomp5.so 4 Intel® Math Kernel Library for Linux* OS User's Guide 36Linking with Computational Libraries If you are not using the Intel MKL cluster software, you need to link your application with only one computational library, depending on the linking method: Static Linking Dynamic Linking lib mkl_core.a lib mkl_core.so Computational Libraries for Applications that Use the Intel MKL Cluster Software ScaLAPACK and Cluster Fourier Transform Functions (Cluster FFT) require more computational libraries, which may depend on your architecture. The following table lists computational libraries for IA-32 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for IA-32 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK † libmkl_scalapack_core.a libmkl_core.a libmkl_scalapack_core.so libmkl_core.so Cluster Fourier Transform Functions † libmkl_cdft_core.a libmkl_core.a libmkl_cdft_core.so libmkl_core.so † Also add the library with BLACS routines corresponding to the MPI used. The following table lists computational libraries for Intel ® 64 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for the Intel ® 64 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK, LP64 interface 1 libmkl_scalapack_lp64.a libmkl_core.a libmkl_scalapack_lp64.so libmkl_core.so ScaLAPACK, ILP64 interface 1 libmkl_scalapack_ilp64.a libmkl_core.a libmkl_scalapack_ilp64.so libmkl_core.so Cluster Fourier Transform Functions 1 libmkl_cdft_core.a libmkl_core.a libmkl_cdft_core.so libmkl_core.so † Also add the library with BLACS routines corresponding to the MPI used. See Also Linking with ScaLAPACK and Cluster FFTs Using the Link-line Advisor Using the ILP64 Interface vs. LP64 Interface Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking Your Application with the Intel® Math Kernel Library 4 37Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the LD_LIBRARY_PATH environment variable is defined correctly. See Also Scripts to Set Environment Variables Layered Model Concept Linking with System Libraries To use the Intel MKL FFT, Trigonometric Transform, or Poisson, Laplace, and Helmholtz Solver routines, link in the math support system library by adding " -lm " to the link line. On Linux OS, the libiomp library relies on the native pthread library for multi-threading. Any time libiomp is required, add -lpthread to your link line afterwards (the order of listing libraries is important). Building Custom Shared Objects ?ustom shared objects reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom shared object builder enables you to create a dynamic library (shared object) containing the selected functions and located in the tools/builder directory. The builder contains a makefile and a definition file with the list of functions. NOTE The objects in Intel MKL static libraries are position-independent code (PIC), which is not typical for static libraries. Therefore, the custom shared object builder can create a shared object from a subset of Intel MKL functions by picking the respective object files from the static libraries. Using the Custom Shared Object Builder To build a custom shared object, use the following command: make target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libia32 The builder uses static Intel MKL interface, threading, and core libraries to build a custom shared object for the IA-32 architecture. libintel64 The builder uses static Intel MKL interface, threading, and core libraries to build a custom shared object for the Intel® 64 architecture. soia32 The builder uses the single dynamic library libmkl_rt.so to build a custom shared object for the IA-32 architecture. sointel64 The builder uses the single dynamic library libmkl_rt.so to build a custom shared object for the Intel® 64 architecture. help The command prints Help on the custom shared object builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: 4 Intel® Math Kernel Library for Linux* OS User's Guide 38Parameter [Values] Description interface = {lp64|ilp64} Defines whether to use LP64 or ILP64 programming interfacefor the Intel 64architecture.The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the shared object. The default name is user_example_list (no extension). name = Specifies the name of the library to be created. By default, the names of the created library is mkl_custom.so. xerbla = Specifies the name of the object file .o that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. MKLROOT = Specifies the location of Intel MKL libraries used to build the custom shared object. By default, the builder uses the Intel MKL installation directory. All the above parameters are optional. In the simplest case, the command line is make ia32, and the missing options have default values. This command creates the mkl_custom.so library for processors using the IA-32 architecture. The command takes the list of functions from the user_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: make ia32 export=my_func_list.txt name=mkl_small xerbla=my_xerbla.o In this case, the command creates the mkl_small.so library for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.o. The process is similar for processors using the Intel® 64 architecture. See Also Using the Single Dynamic Library Composing a List of Functions To compose a list of functions for a minimal custom shared object needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Linking Your Application with the Intel® Math Kernel Library 4 39Specifying Function Names In the file with the list of functions for your custom shared object, adjust function names to the required interface. For example, for Fortran functions append an underscore character "_" to the names as a suffix: dgemm_ ddot_ dgetrf_ For more examples, see domain-specific lists of functions in the /tools/builder folder. NOTE The lists of functions are provided in the /tools/builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom shared object. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. For the names of the Fortran support functions, see the tip. NOTE If selected functions have several processor-specific versions, the builder automatically includes them all in the custom library and the dispatcher manages them. Distributing Your Custom Shared Object To enable use of your custom shared object in a threaded mode, distribute libiomp5.so along with the custom shared object. 4 Intel® Math Kernel Library for Linux* OS User's Guide 40Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. See Also Managing Multi-core Performance Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. 41Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 5 Intel® Math Kernel Library for Linux* OS User's Guide 421D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (pthreads on Linux* OS). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: libmkl_sequential.a or libmkl_sequential.so (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). Section Intel(R) Optimized MP LINPACK Benchmark for Clusters discusses another solution for a Hybrid (OpenMP* + MPI) mode. Managing Performance and Memory 5 43See Also Using Additional Threading Control Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, use the appropriate command in the command shell in which the program is going to run, for example: • For the bash shell, enter: export OMP_NUM_THREADS= • For the csh or tcsh shell, enter: set OMP_NUM_THREADS= See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" 5 Intel® Math Kernel Library for Linux* OS User's Guide 44#include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. Managing Performance and Memory 5 47When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT non-cluster Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. 5 Intel® Math Kernel Library for Linux* OS User's Guide 48Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter the export or set commands, depending on the shell you use. For example, for a bash shell, use the export commands: export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE For the csh or tcsh shell, use the set commands. set =. For example: set MKL_NUM_THREADS=4 set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" set MKL_DYNAMIC=FALSE Tips and Techniques to Improve Performance Managing Performance and Memory 5 49Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals: • 32 bytes for the Intel ® Pentium® III processors • 64 bytes for the Intel ® Pentium® 4 processors and processors using Intel ® 64 architecture Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. 5 Intel® Math Kernel Library for Linux* OS User's Guide 50Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library Managing Multi-core Performance You can obtain best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. Use one of the following options: • OpenMP facilities (recommended, if available), for example, the KMP_AFFINITY environment variable using the Intel OpenMP library • A system function, as explained below Consider the following performance issue: • The system has two sockets with two cores each, for a total of four cores (CPUs) • T he two -thread parallel application that calls the Intel MKL FFT happens to run faster than in four threads, but the performance in two threads is very unstable The following code example shows how to resolve this issue by setting an affinity mask by operating system means using the Intel compiler. The code calls the system function sched_setaffinity to bind the threads to the cores on different sockets. Then the Intel MKL FFT function is called: #define _GNU_SOURCE //for using the GNU CPU affinity // (works with the appropriate kernel and glibc) // Set affinity mask #include #include #include #include int main(void) { int NCPUs = sysconf(_SC_NPROCESSORS_CONF); printf("Using thread affinity on %i NCPUs\n", NCPUs); #pragma omp parallel default(shared) { cpu_set_t new_mask; cpu_set_t was_mask; int tid = omp_get_thread_num(); CPU_ZERO(&new_mask); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) CPU_SET(tid==0 ? 0 : 2, &new_mask); if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) { printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid); } if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) { printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid); } printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, Managing Performance and Memory 5 51 *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask)); } // Call Intel MKL FFT function return 0; } Compile the application with the Intel compiler using the following command: icc test_application.c -openmp where test_application.c is the filename for the application. Build the application. Run it in two threads, for example, by using the environment variable to set the number of threads: env OMP_NUM_THREADS=2 ./a.out See the Linux Programmer's Manual (in man pages format) for particulars of the sched_setaffinity function used in the above example. Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. 5 Intel® Math Kernel Library for Linux* OS User's Guide 52Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. How to Redefine Memory Functions To redefine memory functions, use the following procedure: 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions Managing Performance and Memory 5 535 Intel® Math Kernel Library for Linux* OS User's Guide 54Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories libmkl_blas95.a 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. libmkl_blas95_ilp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. libmkl_blas95_lp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. libmkl_lapack95.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. libmkl_lapack95_lp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. libmkl_lapack95_ilp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 55File name Contains libfftw2xc_intel.a 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. libfftw2xc_gnu.a Interfaces for FFTW version 2.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw2xf_intel.a Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw2xf_gnu.a Interfaces for FFTW version 2.x (Fortran interface for GNU compiler) to call Intel MKL FFTs. libfftw3xc_intel.a 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. libfftw3xc_gnu.a Interfaces for FFTW version 3.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw3xf_intel.a 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw3xf_gnu.a Interfaces for FFTW version 3.x (Fortran interface for GNU compilers) to call Intel MKL FFTs. libfftw2x_cdft_SINGLE.a Single-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. libfftw2x_cdft_DOUBLE.a Double-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. libfftw3x_cdft.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs. libfftw3x_cdft_ilp64.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs supporting the ILP64 interface. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into /interfaces/fftw3x*/ makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS 6 Intel® Math Kernel Library for Linux* OS User's Guide 56Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory /interfaces/blas95 or / interfaces/lapack95 2. Type one of the following commands depending on your architecture: • For the IA-32 architecture, make libia32 INSTALL_DIR= • For the Intel® 64 architecture, make libintel64 [interface=lp64|ilp64] INSTALL_DIR= Important The parameter INSTALL_DIR is required. As a result, the required library is built and installed in the /lib directory, and the .mod files are built and installed in the /include/[/{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of make: FC=. For example, the command make libintel64 FC=pgf95 INSTALL_DIR= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, make cleania32 INSTALL_DIR= • For the Intel ® 64 architecture, make cleanintel64 [interface=lp64|ilp64] INSTALL_DIR= • For all the architectures, make clean INSTALL_DIR= CAUTION Even if you have administrative rights, avoid setting INSTALL_DIR=../.. or INSTALL_DIR= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. Language-specific Usage Options 6 57In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: 6 Intel® Math Kernel Library for Linux* OS User's Guide 58• LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples/lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Language-specific Usage Options 6 59Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } 6 Intel® Math Kernel Library for Linux* OS User's Guide 60Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Language-specific Usage Options 6 61Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the /examples/ublas/source/sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the BOOST_ROOT parameter in the make command, for instance, when using Boost version 1.37.0: make libia32 BOOST_ROOT = /boost_1_37_0 See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: /examples/java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of non-cluster FFT functions 6 Intel® Math Kernel Library for Linux* OS User's Guide 62• ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: /examples/java/examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory • Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in /examples/ java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): /examples/java/docs/index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: Language-specific Usage Options 6 63/examples/java/wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the make utility, which is typically provided with the Linux* OS distribution. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation s for all the supported architectures: • J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc. (http://sun.com/). • JRockit* JDK 1.4.2 and 5.0 from Oracle Corporation (http://oracle.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: • java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example , using thebash shell: export JAVA_HOME=/home//jdk1.5.0_09 export PATH=${JAVA_HOME}/bin:${PATH} 6 Intel® Math Kernel Library for Linux* OS User's Guide 64You may also need to clear the JDK_HOME environment variable, if it is assigned a value: unset JDK_HOME To start the examples, use the makefile found in the Intel MKL Java examples directory: make {soia32|sointel64|libia32|libintel64} [function=...] [compiler=...] If you type the make command and omit the target (for example, soia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. Language-specific Usage Options 6 656 Intel® Math Kernel Library for Linux* OS User's Guide 66Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 67Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Linux* OS User's Guide 68Working with the Intel® Math Kernel Library Cluster Software 8 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking with ScaLAPACK and Cluster FFTs The Intel MKL ScaLAPACK and Cluster FFTs support MPI implementations identified in the Intel MKL Release Notes. To link a program that calls ScaLAPACK or Cluster FFTs, you need to know how to link a message-passing interface (MPI) application first. Use mpi scripts to do this. For example, mpicc or mpif77 are C or FORTRAN 77 scripts, respectively, that use the correct MPI header files. The location of these scripts and the MPI library depends on your MPI implementation. For example, for the default installation of MPICH, /opt/mpich/bin/mpicc and /opt/ mpich/bin/mpif77 are the compiler scripts and /opt/mpich/lib/libmpich.a is the MPI library. Check the documentation that comes with your MPI implementation for implementation-specific details of linking. To link with Intel MKL ScaLAPACK and/or Cluster FFTs, use the following general form : < linker script> \ -L [-Wl,--start-group] \ [-Wl,--end-group] where the placeholders stand for paths and libraries as explained in the following table: One of ScaLAPACK or Cluster FFT libraries for the appropriate architecture and programming interface (LP64 or ILP64). Available libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, it is either - lmkl_scalapack_core or -lmkl_cdft_core. The BLACS library corresponding to your architecture, programming interface (LP64 or ILP64), and MPI version. Available BLACS libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, choose one of - lmkl_blacs, -lmkl_blacs_intelmpi, or -lmkl_blacs_openmpi, depending on the MPI version you use; specifically, for Intel MPI 3.x, choose - lmkl_blacs_intelmpi. for ScaLAPACK, and for Cluster FFTs. Processor optimized kernels, threading library, and system library for threading support, linked as described in Listing Libraries on a Link Line. 69 The LAPACK library and . One of several MPI implementations (MPICH, Intel MPI, and so on). < linker script> A linker script that corresponds to the MPI version. For instance, for Intel MPI 3.x, use . For example, if you are using Intel MPI 3.x, want to statically use the LP64 interface with ScaLAPACK, and have only one MPI process per core (and thus do not use threading), specify the following linker options: -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_scalapack_lp64.a $MKLPATH/ libmkl_blacs_intelmpi_lp64.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -static_mpi -Wl,--end-group -lpthread -lm NOTE Grouping symbols -Wl,--start-group and -Wl,--end-group are required for static linking. TIP Use the Link-line Advisor to quickly choose the appropriate set of , , and . See Also Linking Your Application with the Intel® Math Kernel Library Examples for Linking with ScaLAPACK and Cluster FFT Setting the Number of Threads The OpenMP* software responds to the environment variable OMP_NUM_THREADS. Intel MKL also has other mechanisms to set the number of threads, such as the MKL_NUM_THREADS or MKL_DOMAIN_NUM_THREADS environment variables (see Using Additional Threading Control). Make sure that the relevant environment variables have the same and correct values on all the nodes. Intel MKL versions 10.0 and higher no longer set the default number of threads to one, but depend on the OpenMP libraries used with the compiler to set the default number. For the threading layer based on the Intel compiler (libmkl_intel_thread.a), this value is the number of CPUs according to the OS. CAUTION Avoid over-prescribing the number of threads, which may occur, for instance, when the number of MPI ranks per node and the number of threads per node are both greater than one. The product of MPI ranks per node and the number of threads per node should not exceed the number of physical cores per node. The best way to set an environment variable, such as OMP_NUM_THREADS, is your login environment. Remember that changing this value on the head node and then doing your run, as you do on a sharedmemory (SMP) system, does not change the variable on all the nodes because mpirun starts a fresh default shell on all the nodes. To change the number of threads on all the nodes, in .bashrc, add a line at the top, as follows: OMP_NUM_THREADS=1; export OMP_NUM_THREADS You can run multiple CPUs per node using MPICH. To do this, build MPICH to enable multiple CPUs per node. Be aware that certain MPICH applications may fail to work perfectly in a threaded environment (see the Known Limitations section in the Release Notes. If you encounter problems with MPICH and setting of the number of threads is greater than one, first try setting the number of threads to one and see whether the problem persists. 8 Intel® Math Kernel Library for Linux* OS User's Guide 70See Also Techniques to Set the Number of Threads Using Shared Libraries All needed shared libraries must be visible on all the nodes at run time. To achieve this, point these libraries by the LD_LIBRARY_PATH environment variable in the .bashrc file. If Intel MKL is installed only on one node, link statically when building your Intel MKL applications rather than use shared libraries. The Intel compilers or GNU compilers can be used to compile a program that uses Intel MKL. However, make sure that the MPI implementation and compiler match up correctly. Building ScaLAPACK Tests To build ScaLAPACK tests, • For the IA-32 architecture, add libmkl_scalapack_core.a to your link command. • For the Intel® 64 architecture, add libmkl_scalapack_lp64.a or libmkl_scalapack_ilp64.a, depending on the desired interface. Examples for Linking with ScaLAPACK and Cluster FFT This section provides examples of linking with ScaLAPACK and Cluster FFT. Note that a binary linked with ScaLAPACK runs the same way as any other MPI application (refer to the documentation that comes with your MPI implementation). For instance, the script mpirun is used in the case of MPICH2 and OpenMPI, and a number of MPI processes is set by -np. In the case of MPICH 2.0 and all Intel MPIs, start the daemon before running your application; the execution is driven by the script mpiexec. For further linking examples, see the support website for Intel products at http://www.intel.com/software/ products/support/. See Also Directory Structure in Detail Examples for Linking a C Application These examples illustrate linking of an application whose main module is in C under the following conditions: • MPICH2 1.0.7 or higher is installed in /opt/mpich. • $MKLPATH is a user-defined variable containing /lib/ia32. • You use the Intel® C++ Compiler 10.0 or higher. To link with ScaLAPACK for a cluster of systems based on the IA-32 architecture, use the following link line: /opt/mpich/bin/mpicc \ -L$MKLPATH \ -lmkl_scalapack_core \ -lmkl_blacs_intelmpi \ -lmkl_intel -lmkl_intel_thread -lmkl_core \ -liomp5 -lpthread To link with Cluster FFT for a cluster of systems based on the IA-32 architecture, use the following link line: /opt/mpich/bin/mpicc \ -Wl,--start-group \ $MKLPATH/libmkl_cdft_core.a \ Working with the Intel® Math Kernel Library Cluster Software 8 71 $MKLPATH/libmkl_blacs_intelmpi.a \ $MKLPATH/libmkl_intel.a \ $MKLPATH/libmkl_intel_thread.a \ $MKLPATH/libmkl_core.a \ -Wl,--end-group \ -liomp5 -lpthread See Also Linking with ScaLAPACK and Cluster FFTs Examples for Linking a Fortran Application These examples illustrate linking of an application whose main module is in Fortran under the following conditions: • Intel MPI 3.0 is installed in /opt/intel/mpi/3.0. • $MKLPATH is a user-defined variable containing /lib/intel64 . • You use the Intel® Fortran Compiler 10.0 or higher. To link with ScaLAPACK for a cluster of systems based on the Intel® 64 architecture, use the following link line: /opt/intel/mpi/3.0/bin/mpiifort \ -L$MKLPATH \ -lmkl_scalapack_lp64 \ -lmkl_blacs_intelmpi_lp64 \ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ -liomp5 -lpthread To link with Cluster FFT for a cluster of systems based on the Intel® 64 architecture, use the following link line: /opt/intel/mpi/3.0/bin/mpiifort \ -Wl,--start-group \ $MKLPATH/libmkl_cdft_core.a \ $MKLPATH/libmkl_blacs_intelmpi_ilp64.a \ $MKLPATH/libmkl_intel_ilp64.a \ $MKLPATH/libmkl_intel_thread.a \ $MKLPATH/libmkl_core.a \ -Wl,--end-group \ -liomp5 -lpthread See Also Linking with ScaLAPACK and Cluster FFTs 8 Intel® Math Kernel Library for Linux* OS User's Guide 72Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) 9 Configuring the Eclipse* IDE CDT to Link with Intel MKL This section explains how to configure the Eclipse* Integrated Development Environment (IDE) C/C++ Development Tools (CDT) to link with Intel® Math Kernel Library (Intel® MKL). TIP After configuring your CDT, you can benefit from the Eclipse-provided code assist feature. See Code/Context Assist description in the CDT Help for details. To configure your Eclipse IDE CDT to link with Intel MKL, you need to perform the steps explained below. The specific instructions for performing these steps depend on your version of the CDT and on the tool-chain/ compiler integration. Refer to the CDT Help for more details. To configure your Eclipse IDE CDT, do the following: 1. Open Project Properties for your project. 2. Add the Intel MKL include path, that is, /include, to the project's include paths. 3. Add the Intel MKL library path for the target architecture to the project's library paths. For example, for the Intel® 64 architecture, add /lib/intel64. 4. Specify the names of the Intel MKL libraries to link with your application. For example, you may need the following libraries: mkl_intel_lp64, mkl_intel_thread, mkl_core, and iomp5. NOTE Because compilers typically require library names rather than file names, omit the "lib" prefix and "a" or "so" extension. See Also Selecting Libraries to Link with Linking in Detail Getting Assistance for Programming in the Eclipse* IDE Intel MKL provides an Eclipse* IDE plug-in (com.intel.mkl.help) that contains the Intel MKL Reference Manual (see High-level Directory Structure for the plug-in location after the library installation). To install the plug-in, do one of the following: • Use the Eclipse IDE Update Manager (recommended). To invoke the Manager, use Help > Software Updates command in your Eclipse IDE. • Copy the plug-in to the plugins folder of your Eclipse IDE directory. In this case, if you use earlier C/C++ Development Tools (CDT) versions (3.x, 4.x), delete or rename the index subfolder in the eclipse/configuration/org.eclipse.help.base folder of your Eclipse IDE to avoid delays in Index updating. The following Intel MKL features assist you while programming in the Eclipse* IDE: • The Intel MKL Reference Manual viewable from within the IDE 73• Eclipse Help search tuned to target the Intel Web sites • Code/Content Assist in the Eclipse IDE CDT The Intel MKL plug-in for Eclipse IDE provides the first two features. The last feature is native to the Eclipse IDE CDT. See the Code Assist description in Eclipse IDE Help for details. Viewing the Intel® Math Kernel Library Reference Manual in the Eclipse* IDE To view the Reference Manual, in Eclipse, 1. Select Help > Help Contents from the menu. 2. In the Help tab, under All Topics , click Intel® Math Kernel Library Help . 3. In the Help tree that expands, click Intel Math Kernel Library Reference Manual. 4. The Intel MKL Help Index is also available in Eclipse, and the Reference Manual is included in the Eclipse Help search. Searching the Intel Web Site from the Eclipse* IDE The Intel MKL plug-in tunes Eclipse Help search to targethttp://www.intel.com so that when you are connected to the Internet and run a search from the Eclipse Help pane, the search hits at the site are shown through a separate link. The following figure shows search results for "VML Functions" in Eclipse Help. In the figure, 1 hit means an entry hit to the respective site. Click "Intel.com (1 hit)" to open the list of actual hits to the Intel Web site. 9 Intel® Math Kernel Library for Linux* OS User's Guide 74Programming with Intel® Math Kernel Library in the Eclipse* Integrated Development Environment (IDE) 9 759 Intel® Math Kernel Library for Linux* OS User's Guide 76LINPACK and MP LINPACK Benchmarks 10 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Optimized LINPACK Benchmark for Linux* OS Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with: • MP LINPACK, which is a distributed memory version of the same benchmark. • LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Linux* OS contains the following files, located in the ./ benchmarks/linpack/ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in ./benchmarks/ linpack/ Description xlinpack_xeon32 The 32-bit program executable for a system based on Intel® Xeon® processor or Intel® Xeon® processor MP with or without Streaming SIMD Extensions 3 (SSE3). xlinpack_xeon64 The 64-bit program executable for a system with Intel® Xeon® processor using Intel® 64 architecture. runme_xeon32 A sample shell script for executing a pre-determined problem set for linpack_xeon32. OMP_NUM_THREADS set to 2 processors. runme_xeon64 A sample shell script for executing a pre-determined problem set for linpack_xeon64. OMP_NUM_THREADS set to 4 processors. 77File in ./benchmarks/ linpack/ Description lininput_xeon32 Input file for pre-determined problem for the runme_xeon32 script. lininput_xeon64 Input file for pre-determined problem for the runme_xeon64 script. lin_xeon32.txt Result of the runme_xeon32 script execution. lin_xeon64.txt Result of the runme_xeon64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: ./runme_xeon32 ./runme_xeon64 To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: ./xlinpack_xeon32 -e ./xlinpack_xeon64 -e The pre-defined data input fileslininput_xeon32 and lininput_xeon64 are provided merely as examples. Different systems have different number of processors or amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. Each input file requires at least the following amount of memory: lininput_xeon32 2 GB lininput_xeon64 16 GB If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme_* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 10 Intel® Math Kernel Library for Linux* OS User's Guide 78Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Linux* OS: • Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multiprocessor systems, best performance will be obtained with the Intel® Hyper-Threading Technology turned off, which ensures that the operating system assigns threads to physical processors only. • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. Intel® Optimized MP LINPACK Benchmark for Clusters Overview of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel® Optimized MP LINPACK Benchmark for Clusters is based on modifications and additions to HPL 2.0 from Innovative Computing Laboratories (ICL) at the University of Tennessee, Knoxville (UTK). The Intel Optimized MP LINPACK Benchmark for Clusters can be used for Top 500 runs (see http://www.top500.org). To use the benchmark you need be intimately familiar with the HPL distribution and usage. The Intel Optimized MP LINPACK Benchmark for Clusters provides some additional enhancements and bug fixes designed to make the HPL usage more convenient, as well as explain Intel® Message-Passing Interface (MPI) settings that may enhance performance. The ./benchmarks/mp_linpack directory adds techniques to minimize search times frequently associated with long runs. The Intel® Optimized MP LINPACK Benchmark for Clusters is an implementation of the Massively Parallel MP LINPACK benchmark by means of HPL code. It solves a random dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. You can solve any size (N) system of equations that fit into memory. The benchmark uses full row pivoting to ensure the accuracy of the results. Use the Intel Optimized MP LINPACK Benchmark for Clusters on a distributed memory machine. On a shared memory machine, use the Intel Optimized LINPACK Benchmark. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your systems based on genuine Intel processors more easily than with the HPL benchmark. Use the Intel Optimized MP LINPACK Benchmark to benchmark your cluster. The prebuilt binaries require that you first install Intel® MPI 3.x be installed on the cluster. The run-time version of Intel MPI is free and can be downloaded from www.intel.com/software/products/ . The Intel package includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratories and neither the University nor ICL endorse or promote this product. Although HPL 2.0 is redistributable under certain conditions, this particular package is subject to the Intel MKL license. Intel MKL has introduced a new functionality into MP LINPACK, which is called a hybrid build, while continuing to support the older version. The term hybrid refers to special optimizations added to take advantage of mixed OpenMP*/MPI parallelism. If you want to use one MPI process per node and to achieve further parallelism by means of OpenMP, use the hybrid build. In general, the hybrid build is useful when the number of MPI processes per core is less than one. If you want to rely exclusively on MPI for parallelism and use one MPI per core, use the non-hybrid build. In addition to supplying certain hybrid prebuilt binaries, Intel MKL supplies some hybrid prebuilt libraries for Intel® MPI to take advantage of the additional OpenMP* optimizations. If you wish to use an MPI version other than Intel MPI, you can do so by using the MP LINPACK source provided. You can use the source to build a non-hybrid version that may be used in a hybrid mode, but it would be missing some of the optimizations added to the hybrid version. Non-hybrid builds are the default of the source code makefiles provided. In some cases, the use of the hybrid mode is required for external reasons. If there is a choice, the non-hybrid code may be faster. To use the non-hybrid code in a hybrid mode, use the threaded version of Intel MKL BLAS, link with a thread-safe MPI, and call function MPI_init_thread() so as to indicate a need for MPI to be thread-safe. LINPACK and MP LINPACK Benchmarks 10 79Intel MKL also provides prebuilt binaries that are dynamically linked against Intel MPI libraries. NOTE Performance of statically and dynamically linked prebuilt binaries may be different. The performance of both depends on the version of Intel MPI you are using. You can build binaries statically linked against a particular version of Intel MPI by yourself. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Contents of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel Optimized MP LINPACK Benchmark for Clusters (MP LINPACK Benchmark) includes the HPL 2.0 distribution in its entirety, as well as the modifications delivered in the files listed in the table below and located in the ./benchmarks/mp_linpack/ subdirectory of the Intel MKL directory. Directory/File in ./benchmarks/ mp_linpack/ Contents testing/ptest/HPL_pdtest.c HPL 2.0 code modified to display captured DGEMM information in ASYOUGO2_DISPLAY if it was captured (for details, see New Features). src/blas/HPL_dgemm.c HPL 2.0 code modified to capture DGEMM information, if desired, from ASYOUGO2_DISPLAY. src/grid/HPL_grid_init.c HPL 2.0 code modified to do additional grid experiments originally not in HPL 2.0. src/pgesv/HPL_pdgesvK2.c HPL 2.0 code modified to do ASYOUGO and ENDEARLY modifications. src/pgesv/HPL_pdgesv0.c HPL 2.0 code modified to do ASYOUGO, ASYOUGO2, and ENDEARLY modifications. testing/ptest/HPL.dat HPL 2.0 sample HPL.dat modified. Make.ia32 (New) Sample architecture makefile for processors using the IA-32 architecture and Linux OS. Make.intel64 (New) Sample architecture makefile for processors using the Intel® 64 architecture and Linux OS. HPL.dat A repeat of testing/ptest/HPL.dat in the top-level directory. Prebuilt executables readily available for simple performance testing. bin_intel/ia32/xhpl_ia32 (New) Prebuilt binary for the IA-32 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/ia32/xhpl_ia32_dynamic (New) Prebuilt binary for the IA-32 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. 10 Intel® Math Kernel Library for Linux* OS User's Guide 80Directory/File in ./benchmarks/ mp_linpack/ Contents bin_intel/intel64/xhpl_intel64 (New) Prebuilt binary for the Intel® 64 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_intel64_dynamic (New) Prebuilt binary for the Intel® 64 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. Prebuilt hybrid executables bin_intel/ia32/xhpl_hybrid_ia32 (New) Prebuilt hybrid binary for the IA-32 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/ia32/ xhpl_hybrid_ia32_dynamic (New) Prebuilt hybrid binary for the IA-32 architecture and Linux OS. Dynamically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_hybrid_intel64 (New) Prebuilt hybrid binary for the Intel® 64 architecture and Linux OS. Statically linked against Intel® MPI 3.2. bin_intel/intel64/ xhpl_hybrid_intel64_dynamic (New) Prebuilt hybrid binary for the Intel® 64 and Linux OS. Dynamically linked against Intel® MPI 3.2. Prebuilt libraries lib_hybrid/ia32/libhpl_hybrid.a (New) Prebuilt library with the hybrid version of MP LINPACK for the IA-32 architecture and Intel MPI 3.2. lib_hybrid/intel64/ libhpl_hybrid.a (New) Prebuilt library with the hybrid version of MP LINPACK for the Intel® 64 architecture and Intel MPI 3.2. Files that refer to run scripts bin_intel/ia32/runme_ia32 (New) Sample run script for the IA-32 architecture and a pure MPI binary statically linked against Intel MPI 3.2. bin_intel/ia32/ runme_ia32_dynamic (New) Sample run script for the IA-32 architecture and a pure MPI binary dynamically linked against Intel MPI 3.2. bin_intel/ia32/HPL_serial.dat (New) Example of an MP LINPACK benchmark input file for a pure MPI binary and the IA-32 architecture. bin_intel/ia32/runme_hybrid_ia32 (New) Sample run script for the IA-32 architecture and a hybrid binary statically linked against Intel MPI 3.2. bin_intel/ia32/ runme_hybrid_ia32_dynamic (New) Sample run script for the IA-32 architecture and a hybrid binary dynamically linked against Intel MPI 3.2. bin_intel/ia32/HPL_hybrid.dat (New) Example of an MP LINPACK benchmark input file for a hybrid binary and the IA-32 architecture. bin_intel/intel64/runme_intel64 (New) Sample run script for the Intel® 64 architecture and a pure MPI binary statically linked against Intel MPI 3.2. bin_intel/intel64/ runme_intel64_dynamic (New) Sample run script for the Intel® 64 architecture and a pure MPI binary dynamically linked against Intel MPI 3.2. bin_intel/intel64/HPL_serial.dat (New) Example of an MP LINPACK benchmark input file for a pure MPI binary and the Intel® 64 architecture. bin_intel/intel64/ runme_hybrid_intel64 (New) Sample run script for the Intel® 64 architecture and a hybrid binary statically linked against Intel MPI 3.2. LINPACK and MP LINPACK Benchmarks 10 81Directory/File in ./benchmarks/ mp_linpack/ Contents bin_intel/intel64/ runme_hybrid_intel64_dynamic (New) Sample run script for the Intel® 64 architecture and a hybrid binary dynamically linked against Intel MPI 3.2. bin_intel/intel64/HPL_hybrid.dat (New) Example of an MP LINPACK benchmark input file for a hybrid binary and the Intel® 64 architecture. nodeperf.c (New) Sample utility that tests the DGEMM speed across the cluster. See Also High-level Directory Structure Building the MP LINPACK The MP LINPACK Benchmark contains a few sample architecture makefiles. You can edit them to fit your specific configuration. Specifically: • Set TOPdir to the directory that MP LINPACK is being built in. • You may set MPI variables, that is, MPdir, MPinc, and MPlib. • Specify the location Intel MKL and of files to be used (LAdir, LAinc, LAlib). • Adjust compiler and compiler/linker options. • Specify the version of MP LINPACK you are going to build (hybrid or non-hybrid) by setting the version parameter for the make command. For example: make arch=intel64 version=hybrid install For some sample cases, like Linux systems based on the Intel® 64 architecture, the makefiles contain values that must be common. However, you need to be familiar with building an HPL and picking appropriate values for these variables. New Features of Intel® Optimized MP LINPACK Benchmark The toolset is basically identical with the HPL 2.0 distribution. There are a few changes that are optionally compiled in and disabled until you specifically request them. These new features are: ASYOUGO: Provides non-intrusive performance information while runs proceed. There are only a few outputs and this information does not impact performance. This is especially useful because many runs can go for hours without any information. ASYOUGO2: Provides slightly intrusive additional performance information by intercepting every DGEMM call. ASYOUGO2_DISPLAY: Displays the performance of all the significant DGEMMs inside the run. ENDEARLY: Displays a few performance hints and then terminates the run early. FASTSWAP: Inserts the LAPACK-optimized DLASWP into HPL's code. You can experiment with this to determine best results. HYBRID: Establishes the Hybrid OpenMP/MPI mode of MP LINPACK, providing the possibility to use threaded Intel MKL and prebuilt MP LINPACK hybrid libraries. CAUTION Use this option only with an Intel compiler and the Intel® MPI library version 3.1 or higher. You are also recommended to use the compiler version 10.0 or higher. 10 Intel® Math Kernel Library for Linux* OS User's Guide 82Benchmarking a Cluster To benchmark a cluster, follow the sequence of steps below (some of them are optional). Pay special attention to the iterative steps 3 and 4. They make a loop that searches for HPL parameters (specified in HPL.dat) that enable you to reach the top performance of your cluster. 1. Install HPL and make sure HPL is functional on all the nodes. 2. You may run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes. Compile nodeperf.c with your MPI and Intel MKL. For example: mpiicc -O3 nodeperf.c -L$MKLPATH $MKLPATH/libmkl_intel_lp64.a \ -Wl,--start-group $MKLPATH/libmkl_sequential.a \ $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread . Launching nodeperf.c on all the nodes is especially helpful in a very large cluster. nodeperf enables quick identification of the potential problem spot without numerous small MP LINPACK runs around the cluster in search of the bad node. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by some host identifier. Therefore, the higher the DGEMM performance, the faster that node was performing. 3. Edit HPL.dat to fit your cluster needs. Read through the HPL documentation for ideas on this. Note, however, that you should use at least 4 nodes. 4. Make an HPL run, using compile options such as ASYOUGO, ASYOUGO2, or ENDEARLY to aid in your search. These options enable you to gain insight into the performance sooner than HPL would normally give this insight. When doing so, follow these recommendations: • Use MP LINPACK, which is a patched version of HPL, to save time in the search. All performance intrusive features are compile-optional in MP LINPACK. That is, if you do not use the new options to reduce search time, these features are disabled. The primary purpose of the additions is to assist you in finding solutions. HPL requires a long time to search for many different parameters. In MP LINPACK, the goal is to get the best possible number. Given that the input is not fixed, there is a large parameter space you must search over. An exhaustive search of all possible inputs is improbably large even for a powerful cluster. MP LINPACK optionally prints information on performance as it proceeds. You can also terminate early. • Save time by compiling with -DENDEARLY -DASYOUGO2 and using a negative threshold (do not use a negative threshold on the final run that you intend to submit as a Top500 entry). Set the threshold in line 13 of the HPL 2.0 input file HPL.dat • If you are going to run a problem to completion, do it with -DASYOUGO. 5. Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible. See Also Options to Reduce Search Time Options to Reduce Search Time Running large problems to completion on large numbers of nodes can take many hours. The search space for MP LINPACK is also large: not only can you run any size problem, but over a number of block sizes, grid layouts, lookahead steps, using different factorization methods, and so on. It can be a large waste of time to run a large problem to completion only to discover it ran 0.01% slower than your previous best problem. Use the following options to reduce the search time: • -DASYOUGO • -DENDEARLY • -DASYOUGO2 LINPACK and MP LINPACK Benchmarks 10 83Use -DASYOUGO2 cautiously because it does have a marginal performance impact. To see DGEMM internal performance, compile with -DASYOUGO2 and -DASYOUGO2_DISPLAY. These options provide a lot of useful DGEMM performance information at the cost of around 0.2% performance loss. If you want to use the old HPL, simply omit these options and recompile from scratch. To do this, try "make arch= clean_arch_all". -DASYOUGO -DASYOUGO gives performance data as the run proceeds. The performance always starts off higher and then drops because this actually happens in LU decomposition (a decomposition of a matrix into a product of a lower (L) and upper (U) triangular matrices). The ASYOUGO performance estimate is usually an overestimate (because the LU decomposition slows down as it goes), but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where one is in the LU decomposition that MP LINPACK performs and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. So, refer to the description of the -DASYOUGO2 option below for the details of the output. -DENDEARLY -DENDEARLY t erminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You do not need to define both, although it doesn't hurt. To avoid the residual check for a problem that terminates early, set the "threshold" parameter in HPL.dat to a negative number when testing ENDEARLY. It also sometimes gives a better picture to compile with -DASYOUGO2 when using - DENDEARLY. Usage notes on -DENDEARLY follow: • -DENDEARLY stops the problem after a few iterations of DGEMM on the block size (the bigger the blocksize, the further it gets). It prints only 5 or 6 "updates", whereas -DASYOUGO prints about 46 or so output elements before the problem completes. • Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (because that is what LU does). -DENDEARLY is likely to terminate before it starts to slow down. • -DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong because the problem never completed. However, you can get an idea what the initial performance was, and if it looks good, then run the problem to completion without - DENDEARLY. To avoid the error check, you can set HPL's threshold parameter in HPL.dat to a negative number. • Though -DENDEARLY terminates early, HPL treats the problem as completed and computes Gflop rating as though the problem ran to completion. Ignore this erroneously high rating. • The bigger the problem, the more accurately the last update that -DENDEARLY returns is close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you are suggested to use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting. -DASYOUGO2 -DASYOUGO2 gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal intrusive overhead. Unlike -DASYOUGO, which is quite non-intrusive, -DASYOUGO2 interrupts every DGEMM call to monitor its performance. You should beware of this overhead, although for big problems, it is, less than 0.1%. Here is a sample ASYOUGO2 output (the first 3 non-intrusive numbers can be found in ASYOUGO and ENDEARLY), so it suffices to describe these numbers here: 10 Intel® Math Kernel Library for Linux* OS User's Guide 84Col=001280 Fract=0.050 Mflops=42454.99 (DT=9.5 DF=34.1 DMF=38322.78). The problem size was N=16000 with a block size of 128. After 10 blocks, that is, 1280 columns, an output was sent to the screen. Here, the fraction of columns completed is 1280/16000=0.08. Only up to 40 outputs are printed, at various places through the matrix decomposition: fractions 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385 0.390 0.395 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.515 0.535 0.555 0.575 0.595 0.615 0.635 0.655 0.675 0.695 0.795 0.895. However, this problem size is so small and the block size so big by comparison that as soon as it prints the value for 0.045, it was already through 0.08 fraction of the columns. On a really big problem, the fractional number will be more accurate. It never prints more than the 112 numbers above. So, smaller problems will have fewer than 112 updates, and the biggest problems will have precisely 112 updates. Mflops is an estimate based on 1280 columns of LU being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs. The 3 numbers in parenthesis are intrusive ASYOUGO2 addins. DT is the total time processor 0 has spent in DGEMM. DF is the number of billion operations that have been performed in DGEMM by one processor. Hence, the performance of processor 0 (in Gflops) in DGEMM is always DF/DT. Using the number of DGEMM flops as a basis instead of the number of LU flops, you get a lower bound on performance of the run by looking at DMF, which can be compared to Mflops above (It uses the global LU time, but the DGEMM flops are computed under the assumption that the problem is evenly distributed amongst the nodes, as only HPL's node (0,0) returns any output.) Note that when using the above performance monitoring tools to compare different HPL.dat input data sets, you should be aware that the pattern of performance drop-off that LU experiences is sensitive to some input data. For instance, when you try very small problems, the performance drop-off from the initial values to end values is very rapid. The larger the problem, the less the drop-off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop-off is the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal in value. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps. Using these tools will greatly assist the amount of data you can test. See Also Benchmarking a Cluster LINPACK and MP LINPACK Benchmarks 10 8510 Intel® Math Kernel Library for Linux* OS User's Guide 86Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes Parallel Basic Linear Algebra Subprograms (PBLAS) Yes ScaLAPACK routines Yes † DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Cluster FFT functions Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes † Supported using a mixed language programming call. See Intel ® MKL Include Files for the respective header file. 87†† GMP Arithmetic Functions are deprecated and will be removed in a future release. Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h ScaLAPACK Routines mkl_scalapack.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Cluster Fourier Transform Functions mkl_cdft.f90 mkl_cdft.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.h A Intel® Math Kernel Library for Linux* OS User's Guide 88Function domain Fortran Include Files C/C++ Include Files mkl_service.fi Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 89A Intel® Math Kernel Library for Linux* OS User's Guide 90Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 91B Intel® Math Kernel Library for Linux* OS User's Guide 92Directory Structure in Detail C Tables in this section show contents of the Intel(R) Math Kernel Library (Intel(R) MKL) architecture-specific directories. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Detailed Structure of the IA-32 Architecture Directories Static Libraries in the lib/ia32 Directory File Contents Interface layer libmkl_intel.a Interface library for the Intel compilers libmkl_blas95.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler libmkl_lapack95.a Fortran 95 interface library for LAPACK for the Intel Fortran compiler libmkl_gf.a Interface library for the GNU* Fortran compiler Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_gnu_thread.a Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.a Threading library for the PGI* compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library for the IA-32 architecture libmkl_solver.a Deprecated. Empty library for backward compatibility libmkl_solver_sequential.a Deprecated. Empty library for backward compatibility libmkl_scalapack_core.a ScaLAPACK routines libmkl_cdft_core.a Cluster version of FFT functions 93File Contents Run-time Libraries (RTL) libmkl_blacs.a BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_intelmpi.a BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi20.a A soft link to lib/32/libmkl_blacs_intelmpi.a libmkl_blacs_openmpi.a BLACS routines supporting OpenMPI Dynamic Libraries in the lib/ia32 Directory File Contents libmkl_rt.so Single Dynamic Library Interface layer libmkl_intel.so Interface library for the Intel compilers libmkl_gf.so Interface library for the GNU Fortran compiler Threading layer libmkl_intel_thread.so Threading library for the Intel compilers libmkl_gnu_thread.so Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.so Threading library for the PGI* compiler libmkl_sequential.so Sequential library Computational layer libmkl_core.so Library dispatcher for dynamic load of processor-specific kernel library libmkl_def.so Default kernel library (Intel® Pentium®, Pentium® Pro, Pentium® II, and Pentium® III processors) libmkl_p4.so Pentium® 4 processor kernel library libmkl_p4p.so Kernel library for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. libmkl_p4m.so Kernel library for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_p4p.so is intended) libmkl_p4m3.so Kernel library for the Intel® Core™ i7 processors libmkl_vml_def.so VML/VSL part of default kernel for old Intel® Pentium® processors libmkl_vml_ia.so VML/VSL default kernel for newer Intel® architecture processors C Intel® Math Kernel Library for Linux* OS User's Guide 94File Contents libmkl_vml_p4.so VML/VSL part of Pentium® 4 processor kernel libmkl_vml_p4m.so VML/VSL for processors based on the Intel® Core™ microarchitecture libmkl_vml_p4m2.so VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families libmkl_vml_p4m3.so VML/VSL for the Intel® Core™ i7 processors libmkl_vml_p4p.so VML/VSL for Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) libmkl_vml_avx.so VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) libmkl_scalapack_core.so ScaLAPACK routines. libmkl_cdft_core.so Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_blacs_intelmpi.so BLACS routines supporting Intel MPI and MPICH2 locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English locale/ja_JP/mkl_msg.cat Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information Detailed Structure of the Intel® 64 Architecture Directories Static Libraries in the lib/intel64 Directory File Contents Interface layer libmkl_intel_lp64.a LP64 interface library for the Intel compilers libmkl_intel_ilp64.a ILP64 interface library for the Intel compilers libmkl_intel_sp2dp.a SP2DP interface library for the Intel compilers libmkl_blas95_lp64.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler. Supports the LP64 interface libmkl_blas95_ilp64.a Fortran 95 interface library for BLAS for the Intel® Fortran compiler. Supports the ILP64 interface libmkl_lapack95_lp64.a Fortran 95 interface library for LAPACK for the Intel® Fortran compiler. Supports the LP64 interface libmkl_lapack95_ilp64.a Fortran 95 interface library for LAPACK for the Intel® Fortran compiler. Supports the ILP64 interface Directory Structure in Detail C 95File Contents libmkl_gf_lp64.a LP64 interface library for the GNU Fortran compilers libmkl_gf_ilp64.a ILP64 interface library for the GNU Fortran compilers Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_gnu_thread.a Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.a Threading library for the PGI compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library for the Intel® 64 architecture libmkl_solver_lp64.a Deprecated. Empty library for backward compatibility libmkl_solver_lp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_scalapack_lp64.a ScaLAPACK routine library supporting the LP64 interface libmkl_scalapack_ilp64.a ScaLAPACK routine library supporting the ILP64 interface libmkl_cdft_core.a Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_blacs_lp64.a LP64 version of BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_ilp64.a ILP64 version of BLACS routines supporting the following MPICH versions: • Myricom* MPICH version 1.2.5.10 • ANL* MPICH version 1.2.5.2 libmkl_blacs_intelmpi_lp64.a LP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi_ilp64.a ILP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_blacs_intelmpi20_lp64.a A soft link to lib/intel64/ libmkl_blacs_intelmpi_lp64.a libmkl_blacs_intelmpi20_ilp64.a A soft link to lib/intel64/ libmkl_blacs_intelmpi_ilp64.a libmkl_blacs_openmpi_lp64.a LP64 version of BLACS routines supporting OpenMPI. libmkl_blacs_openmpi_ilp64.a ILP64 version of BLACS routines supporting OpenMPI. libmkl_blacs_sgimpt_lp64.a LP64 version of BLACS routines supporting SGI MPT. C Intel® Math Kernel Library for Linux* OS User's Guide 96File Contents libmkl_blacs_sgimpt_ilp64.a ILP64 version of BLACS routines supporting SGI MPT. Dynamic Libraries in the lib/intel64 Directory File Contents libmkl_rt.so Single Dynamic Library Interface layer libmkl_intel_lp64.so LP64 interface library for the Intel compilers libmkl_intel_ilp64.so ILP64 interface library for the Intel compilers libmkl_intel_sp2dp.so SP2DP interface library for the Intel compilers libmkl_gf_lp64.so LP64 interface library for the GNU Fortran compilers libmkl_gf_ilp64.so ILP64 interface library for the GNU Fortran compilers Threading layer libmkl_intel_thread.so Threading library for the Intel compilers libmkl_gnu_thread.so Threading library for the GNU Fortran and C compilers libmkl_pgi_thread.so Threading library for the PGI* compiler libmkl_sequential.so Sequential library Computational layer libmkl_core.so Library dispatcher for dynamic load of processor-specific kernel libmkl_def.so Default kernel library libmkl_mc.so Kernel library for processors based on the Intel® Core™ microarchitecture libmkl_mc3.so Kernel library for the Intel® Core™ i7 processors libmkl_avx.so Kernel optimized for the Intel® Advanced Vector Extensions (Intel® AVX). libmkl_vml_def.so VML/VSL part of default kernels libmkl_vml_p4n.so VML/VSL for the Intel® Xeon® processor using the Intel® 64 architecture libmkl_vml_mc.so VML/VSL for processors based on the Intel® Core™ microarchitecture libmkl_vml_mc2.so VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families libmkl_vml_mc3.so VML/VSL for the Intel® Core™ i7 processors libmkl_vml_avx.so VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) libmkl_scalapack_lp64.so ScaLAPACK routine library supporting the LP64 interface Directory Structure in Detail C 97File Contents libmkl_scalapack_ilp64.so ScaLAPACK routine library supporting the ILP64 interface libmkl_cdft_core.so Cluster version of FFT functions. Run-time Libraries (RTL) libmkl_intelmpi_lp64.so LP64 version of BLACS routines supporting Intel MPI and MPICH2 libmkl_intelmpi_ilp64.so ILP64 version of BLACS routines supporting Intel MPI and MPICH2 locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English locale/ja_JP/mkl_msg.cat Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information C Intel® Math Kernel Library for Linux* OS User's Guide 98Index A affinity mask 51 aligning data 67 architecture support 23 B BLAS calling routines from C 58 Fortran 95 interface to 57 threaded routines 41 C C interface to LAPACK, use of 58 C, calling LAPACK, BLAS, CBLAS from 58 C/C++, Intel(R) MKL complex types 59 calling BLAS functions from C 60 CBLAS interface from C 60 complex BLAS Level 1 function from C 60 complex BLAS Level 1 function from C++ 60 Fortran-style routines from C 58 CBLAS interface, use of 58 Cluster FFT, linking with 69 cluster software, Intel(R) MKL cluster software, linking with commands 69 linking examples 71 code examples, use of 20 coding data alignment techniques to improve performance 50 compilation, Intel(R) MKL version-dependent 68 compiler run-time libraries, linking with 37 compiler-dependent function 57 complex types in C and C++, Intel(R) MKL 59 computation results, consistency 67 computational libraries, linking with 37 conditional compilation 68 configuring Eclipse* CDT 73 consistent results 67 conventions, notational 13 custom shared object building 38 composing list of functions 39 specifying function names 40 D denormal number, performance 52 directory structure documentation 26 high-level 23 in-detail documentation directories, contents 26 man pages 26 documentation, for Intel(R) MKL, viewing in Eclipse* IDE 74 E Eclipse* CDT configuring 73 viewing Intel(R) MKL documentation in 74 Eclipse* IDE, searching the Intel Web site 74 Enter index keyword 27 environment variables, setting 18 examples, linking for cluster software 71 general 29 F FFT interface data alignment 50 optimised radices 52 threaded problems 41 FFTW interface support 91 Fortran 95 interface libraries 35 G GNU* Multiple Precision Arithmetic Library 91 H header files, Intel(R) MKL 88 HT technology, configuration tip 50 hybrid, version, of MP LINPACK 79 I ILP64 programming, support for 33 include files, Intel(R) MKL 88 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 50 Intel(R) Web site, searching in Eclipse* IDE 74 interface Fortran 95, libraries 35 LP64 and ILP64, use of 33 interface libraries and modules, Intel(R) MKL 55 interface libraries, linking with 33 J Java* examples 62 L language interfaces support 87 language-specific interfaces interface libraries and modules 55 LAPACK C interface to, use of 58 calling routines from C 58 Fortran 95 interface to 57 performance of packed routines 50 threaded routines 41 layers, Intel(R) MKL structure 24 libraries to link with computational 37 interface 33 run-time 37 system libraries 38 Index 99threading 36 link tool, command line 29 link-line syntax 31 linking examples cluster software 71 general 29 linking with compiler run-time libraries 37 computational libraries 37 interface libraries 33 system libraries 38 threading libraries 36 linking, quick start 27 linking, Web-based advisor 29 LINPACK benchmark 77 M man pages, viewing 26 memory functions, redefining 53 memory management 52 memory renaming 53 mixed-language programming 58 module, Fortran 95 57 MP LINPACK benchmark 79 multi-core performance 51 N notational conventions 13 number of threads changing at run time 44 changing with OpenMP* environment variable 44 Intel(R) MKL choice, particular cases 47 setting for cluster 70 techniques to set 44 P parallel performance 43 parallelism, of Intel(R) MKL 41 performance multi-core 51 with denormals 52 with subnormals 52 S ScaLAPACK, linking with 69 SDL 28, 32 sequential mode of Intel(R) MKL 35 Single Dynamic Library 28, 32 structure high-level 23 in-detail model 24 support, technical 11 supported architectures 23 syntax, link-line 31 system libraries, linking with 38 T technical support 11 thread safety, of Intel(R) MKL 41 threaded functions 41 threaded problems 41 threading control, Intel(R) MKL-specific 46 threading libraries, linking with 36 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 61 unstable output, getting rid of 67 usage information 15 Intel® Math Kernel Library for Linux* OS User's Guide 100 Intel ® Math Kernel Library for Mac OS* X User's Guide Intel® MKL - Mac OS* X Document Number: 315932-018US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................21 High-level Directory Structure....................................................................21 Layered Model Concept.............................................................................22 Accessing the Intel ® Math Kernel Library Documentation...............................23 Contents of the Documentation Directories..........................................23 Viewing Man Pages..........................................................................24 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................25 Using the -mkl Compiler Option.........................................................25 Using the Single Dynamic Library.......................................................26 Selecting Libraries to Link with..........................................................26 Using the Link-line Advisor................................................................27 Using the Command-line Link Tool.....................................................27 Linking Examples.....................................................................................27 Linking on IA-32 Architecture Systems...............................................27 Linking on Intel(R) 64 Architecture Systems........................................28 Linking in Detail.......................................................................................29 Listing Libraries on a Link Line...........................................................29 Dynamically Selecting the Interface and Threading Layer......................30 Linking with Interface Libraries..........................................................31 Using the ILP64 Interface vs. LP64 Interface...............................31 Linking with Fortran 95 Interface Libraries..................................33 Linking with Threading Libraries.........................................................33 Sequential Mode of the Library..................................................33 Selecting the Threading Layer...................................................33 Linking with Compiler Run-time Libraries............................................34 Contents 3Linking with System Libraries............................................................34 Building Custom Dynamically Linked Shared Libraries ..................................35 Using the Custom Dynamically Linked Shared Library Builder................35 Composing a List of Functions ..........................................................36 Specifying Function Names...............................................................36 Distributing Your Custom Dynamically Linked Shared Library.................37 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................39 Threaded Functions and Problems......................................................39 Avoiding Conflicts in the Execution Environment..................................41 Techniques to Set the Number of Threads...........................................42 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................42 Changing the Number of Threads at Run Time.....................................42 Using Additional Threading Control.....................................................44 Intel MKL-specific Environment Variables for Threading Control. . . . .44 MKL_DYNAMIC........................................................................45 MKL_DOMAIN_NUM_THREADS..................................................46 Setting the Environment Variables for Threading Control..............47 Tips and Techniques to Improve Performance..............................................47 Coding Techniques...........................................................................47 Hardware Configuration Tips.............................................................48 Operating on Denormals...................................................................49 FFT Optimized Radices.....................................................................49 Using Memory Management ......................................................................49 Intel MKL Memory Management Software............................................49 Redefining Memory Functions............................................................49 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................51 Interface Libraries and Modules.........................................................51 Fortran 95 Interfaces to LAPACK and BLAS..........................................52 Compiler-dependent Functions and Fortran 90 Modules.........................53 Mixed-language Programming with the Intel Math Kernel Library....................53 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................54 Using Complex Types in C/C++.........................................................55 Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................55 Support for Boost uBLAS Matrix-matrix Multiplication...........................57 Invoking Intel MKL Functions from Java* Applications...........................58 Intel MKL Java* Examples........................................................58 Running the Java* Examples.....................................................60 Known Limitations of the Java* Examples...................................60 Chapter 7: Coding Tips Aligning Data for Consistent Results...........................................................63 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................64 Intel® Math Kernel Library for Mac OS* X User's Guide 4Chapter 8: Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library Configuring the Apple Xcode* Developer Software to Link with Intel ® Math Kernel Library......................................................................................65 Chapter 9: Intel® Optimized LINPACK Benchmark for Mac OS* X Contents of the Intel ® Optimized LINPACK Benchmark..................................67 Running the Software...............................................................................67 Known Limitations of the Intel ® Optimized LINPACK Benchmark.....................68 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................69 Include Files............................................................................................70 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................73 FFTW Interface Support............................................................................73 Appendix C: Directory Structure in Detail Static Libraries in the lib directory..............................................................75 Dynamic Libraries in the lib directory..........................................................76 Contents 5Intel® Math Kernel Library for Mac OS* X User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2007 - 2011, Intel Corporation. All rights reserved. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for 7Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Mac OS* X User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Mac OS* X User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. 11 Intel® Math Kernel Library for Mac OS* X User's Guide 12Notational Conventions The following term is used in reference to the operating system. Mac OS * X This term refers to information that is valid on all Intel®-based systems running the Mac OS* X operating system. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Fortran Composer XE . The main directory where Intel MKL is installed: =/mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl -liomp5 -lpthread • Filenames, directory names, and pathnames, for example, /System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Mac OS* X User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Mac OS X programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product. Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Mac OS * X Release Notes. 151 Intel® Math Kernel Library for Mac OS* X User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the /bin directory and its subdirectories: mklvars.sh mklvars.csh ia32/mklvars_ia32.sh ia32/mklvars_ia32.csh intel64/mklvars_intel64.sh intel64/mklvars_intel64.csh Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, launch an Intel MKL example, as explained in Using Code Examples. See Also Notational Conventions Setting Environment Variables When the installation of Intel MKL for Mac OS* X is complete, set the INCLUDE, MKLROOT, DYLD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, FPATH, and NLSPATH environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory. Choose the script corresponding to your system architecture and command shell as explained in the following table: 17Architecture Shell Script File IA-32 C ia32/mklvars_ia32.csh IA-32 Bash ia32/mklvars_ia32.sh Intel® 64 C intel64/mklvars_intel64.csh Intel® 64 Bash intel64/mklvars_intel64.sh IA-32 and Intel® 64 C mklvars.csh IA-32 and Intel® 64 Bash mklvars.sh Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the FPATH environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the scriptname (regardless of the extension). The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32.sh sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64.sh mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the FPATH environment variable. • The command mklvars.sh intel64 mod sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the FPATH environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. 2 Intel® Math Kernel Library for Mac OS* X User's Guide 18See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. See Also Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples/spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples/vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Linking Examples To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS • LAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions Getting Started 2 19• Fourier Transform functions (FFT) • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static • Dynamic Reason: The link line syntax and libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. 2 Intel® Math Kernel Library for Mac OS* X User's Guide 20Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Mac OS* X supports the IA-32, Intel® 64, and compatible architectures in its universal libraries, located in the /lib directory. NOTE Universal libraries contain both 32-bit and 64-bit code. If these libraries are used for linking, the linker dispatches appropriate code as follows: • A 32-bit linker dispatches 32-bit code and creates 32-bit executable files. • A 64-bit linker dispatches 64-bit code and creates 64-bit executable files. See Also High-level Directory Structure Directory Structure in Detail High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin/ Scripts to set environmental variables in the user shell bin/ia32 Shell scripts for the IA-32 architecture bin/intel64 Shell scripts for the Intel® 64 architecture benchmarks/linpack Shared-Memory (SMP) version of LINPACK benchmark examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples include/ia32 Fortran 95 .mod files for the IA-32 architecture and Intel® Fortran compiler 21Directory Contents include/intel64/lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and LP64 interface include/intel64/ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel® Fortran compiler, and ILP64 interface include/fftw Header files for the FFTW2 and FFTW3 interfaces interfaces/blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces/fftw2xc FFTW 2.x interfaces to Intel MKL FFTs (C interface) interfaces/fftw2xf FFTW 2.x interfaces to Intel MKL FFTs (Fortran interface) interfaces/fftw3xc FFTW 3.x interfaces to Intel MKL FFTs (C interface) interfaces/fftw3xf FFTW 3.x interfaces to Intel MKL FFTs (Fortran interface) interfaces/lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library lib Universal static libraries and shared objects for the IA-32 and Intel® 64 architectures tests Source and data files for tests tools Tools and plug-ins tools/builder Tools for creating custom dynamically linkable libraries tools/plugins/ com.intel.mkl.help Eclipse* IDE plug-in with Intel MKL Reference Manual in WebHelp format. See mkl_documentation.htm for more information Subdirectories of Documentation/en_US/mkl Intel MKL documentation man/en_US/man3 Man pages for Intel MKL functions See Also Notational Conventions Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. 3 Intel® Math Kernel Library for Mac OS* X User's Guide 22Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, GNU*). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Accessing the Intel® Math Kernel Library Documentation Contents of the Documentation Directories Most of Intel MKL documentation is installed at /Documentation// mkl. For example, the documentation in English is installed at / Documentation/en_US/mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in /Documentation /clicense or /flicense Common end user license for the Intel® C++ Composer XE 2011 or Intel® Fortran Composer XE 2011, respectively Structure of the Intel® Math Kernel Library 3 23File name Comment mklsupport.txt Information on package number for customer support reference Contents of /Documentation//mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual/index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide/index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor Viewing Man Pages To access Intel MKL man pages, add the man pages directory to the MANPATH environment variable. If you performed the Setting Environment Variables step of the Getting Started process, this is done automatically. To view the man page for an Intel MKL function, enter the following command in your command shell: man In this release, is the function name with omitted prefixes denoting data type, task type, or any other field that may vary for this function. Examples: • For the BLAS function ddot, enter man dot • For the statistical function vslConvSetMode, enter man vslSetMode • For the VML function vdPackM , enter man vPack • For the FFT function DftiCommitDescriptor, enter man DftiCommitDescriptor NOTE Function names in the man command are case-sensitive. See Also High-level Directory Structure Setting Environment Variables 3 Intel® Math Kernel Library for Mac OS* X User's Guide 24Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application, which depend on the way you link: Using the Intel® Composer XE compiler see Using the -mkl Compiler Option. Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the -mkl Compiler Option The Intel® Composer XE compiler supports the following variants of the -mkl compiler option: -mkl or -mkl=parallel to link with standard threaded Intel MKL. -mkl=sequential to link with sequential version of Intel MKL. -mkl=cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the -mkl compiler option, see the Intel Compiler User and Reference Guides. On Intel® 64 architecture systems, for each variant of the -mkl option, the compiler links your application using the LP64 interface. If you specify any variant of the -mkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. See Also Listing Libraries on a Link Line Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor 25Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place libmkl_rt.dylib on your link line. For example: ic? application.c -lmkl_rt SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking libmkl_intel.a libmkl_intel_ thread.a libmkl_core.a libiomp5.dylib IA-32 architecture, dynamic linking libmkl_intel. dylib libmkl_intel_ thread.dylib libmkl_core. dylib libiomp5.dylib Intel® 64 architecture, static linking libmkl_intel_ lp64.a libmkl_intel_ thread.a libmkl_core.a libiomp5.dylib Intel® 64 architecture, dynamic linking libmkl_intel_ lp64.dylib libmkl_intel_ thread.dylib libmkl_core. dylib libiomp5.dylib The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures libmkl_rt.dylib libiomp5.dylib † † Use the Link-line Advisor to check whether you need to explicitly link the libiomp5.dylib RTL. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept 4 Intel® Math Kernel Library for Mac OS* X User's Guide 26Using the Link-line Advisor Using the -mkl Compiler Option Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool is installed in the /tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib, MKLINCLUDE=$MKLROOT/include : • Static linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL: Linking Your Application with the Intel® Math Kernel Library 4 27ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a -lpthread • Dynamic linking of myprog.f and sequential version of Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_sequential -lmkl_core -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_lapack95 $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32 -lmkl_blas95 $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/ libmkl_core.a -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc NOTE If you successfully completed the Setting Environment Variables step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit -L$MKLPATH in the examples for dynamic linking. In these examples, MKLPATH=$MKLROOT/lib, MKLINCLUDE=$MKLROOT/include: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_lp64.a $MKLPATH/ libmkl_sequential.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -lpthread 4 Intel® Math Kernel Library for Mac OS* X User's Guide 28• Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/ libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f -lmkl_rt • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_lapack95_lp64 $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/intel64/lp64 -lmkl_blas95_lp64 $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread See Also Fortran 95 Interfaces to LAPACK and BLAS Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Listing Libraries on a Link Line To link with Intel MKL, specify paths and libraries on the link line as shown below. NOTE The syntax below is for dynamic linking. For static linking, replace each library name preceded with "-l" with the path to the library file. For example, replace -lmkl_core with $MKLPATH/ libmkl_core.a, where $MKLPATH is the appropriate user-defined environment variable. -L -I [-I/{ia32|intel64|{ilp64|lp64}}] [-lmkl_blas{95|95_ilp64|95_lp64}] [-lmkl_lapack{95|95_ilp64|95_lp64}] -lmkl_{intel|intel_ilp64|intel_lp64} Linking Your Application with the Intel® Math Kernel Library 4 29-lmkl_{intel_thread|sequential} -lmkl_core -liomp5 [-lpthread] [-lm] In case of static linking, for all components except BLAS and FFT, repeat interface, threading, and computational libraries two times (for example, libmkl_intel_ilp64.a libmkl_intel_thread.a libmkl_core.a libmkl_intel_ilp64.a libmkl_intel_thread.a libmkl_core.a). For the LAPACK component, repeat the threading and computational libraries three times. The order of listing libraries on the link line is essential. See Also Using the Link-line Advisor Linking Examples Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. 4 Intel® Math Kernel Library for Mac OS* X User's Guide 30See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. See Also Using the Single Dynamic Library Layered Model Concept Directory Structure in Detail Linking with Interface Libraries Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • libmkl_intel_lp64.a or libmkl_intel_ilp64.a for static linking • libmkl_intel_lp64.dylib or libmkl_intel_ilp64.dylib for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the -i8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Fortran Compiling for ILP64 ifort -i8 -I/include ... Compiling for LP64 ifort -I/include ... C or C++ Compiling for ILP64 icc -DMKL_ILP64 -I/include ... Compiling for LP64 icc -I/include ... CAUTION Linking of an application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Linking Your Application with the Intel® Math Kernel Library 4 31Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Directory Structure in Detail 4 Intel® Math Kernel Library for Mac OS* X User's Guide 32Linking with Fortran 95 Interface Libraries The libmkl_blas95*.a and libmkl_lapack95*.a libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. Add the POSIX threads library (pthread) to your link line for the sequential mode because the *sequential.* library depends on pthread . See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel, gnu and PGI* compilers on Mac OS X). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Mac OS X (GNU). That is, a program threaded with a GNU compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): Linking Your Application with the Intel® Math Kernel Library 4 33Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter libmkl_intel_ thread.a libiomp5.dylib PGI Yes libmkl_pgi_ thread.a or libmkl_ sequential.a PGI* supplied Use of libmkl_sequential.a removes threading from Intel MKL calls. PGI No libmkl_intel_ thread.a libiomp5.dylib PGI No libmkl_pgi_ thread.a PGI* supplied PGI No libmkl_ sequential.a None gnu Yes libmkl_ sequential.a None gnu No libmkl_intel_ thread.a libiomp5.dylib other Yes libmkl_ sequential.a None other No libmkl_intel_ thread.a libiomp5.dylib Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the DYLD_LIBRARY_PATH environment variable is defined correctly. See Also Setting Environment Variables Layered Model Concept Linking with System Libraries To use the Intel MKL FFT, Trigonometric Transform, or Poisson, Laplace, and Helmholtz Solver routines, link in the math support system library by adding " -lm " to the link line. On Mac OS X, the libiomp library relies on the native pthread library for multi-threading. Any time libiomp is required, add -lpthread to your link line afterwards (the order of listing libraries is important). 4 Intel® Math Kernel Library for Mac OS* X User's Guide 34Building Custom Dynamically Linked Shared Libraries ?ustom dynamically linked shared libraries reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom dynamically linked shared library builder enables you to create a dynamic ally linked shared library containing the selected functions and located in the tools/builder directory. The builder contains a makefile and a definition file with the list of functions. Using the Custom Dynamically Linked Shared Library Builder To build a custom dynamically linked shared library, use the following command: make target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libuni The builder uses static Intel MKL interface, threading, and core libraries to build a universal dynamically linked shared library for the IA-32 or Intel® 64 architecture. dylibuni The builder uses the single dynamic library libmkl_rt.dylib to build a universal dynamically linked shared library for the IA-32 or Intel® 64 architecture. help The command prints Help on the custom dynamically linked shared library builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: Parameter [Values] Description interface = {lp64|ilp64} Defines whether to use LP64 or ILP64 programming interfacefor the Intel 64architecture.The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the shared object. The default name is user_example_list (no extension). name = Specifies the name of the library to be created. By default, the names of the created library is mkl_custom.dylib. xerbla = Specifies the name of the object file .o that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. MKLROOT = Specifies the location of Intel MKL libraries used to build the custom dynamically linked shared library. By default, the builder uses the Intel MKL installation directory. All the above parameters are optional. In the simplest case, the command line is make ia32, and the missing options have default values. This command creates the mkl_custom.dylib library for processors using the IA-32 architecture. The command takes the list of functions from the user_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: Linking Your Application with the Intel® Math Kernel Library 4 35make ia32 export=my_func_list.txt name=mkl_small xerbla=my_xerbla.o In this case, the command creates the mkl_small.dylib library for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.o. The process is similar for processors using the Intel® 64 architecture. See Also Using the Single Dynamic Library Composing a List of Functions To compose a list of functions for a minimal custom dynamically linked shared library needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Specifying Function Names In the file with the list of functions for your custom dynamically linked shared library, adjust function names to the required interface. For example, for Fortran functions append an underscore character "_" to the names as a suffix: dgemm_ ddot_ dgetrf_ For more examples, see domain-specific lists of functions in the /tools/builder folder. NOTE The lists of functions are provided in the /tools/builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom dynamically linked shared library. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. 4 Intel® Math Kernel Library for Mac OS* X User's Guide 36For the names of the Fortran support functions, see the tip. NOTE If selected functions have several processor-specific versions, the builder automatically includes them all in the custom library and the dispatcher manages them. Distributing Your Custom Dynamically Linked Shared Library To enable use of your custom dynamically linked shared library in a threaded mode, distribute libiomp5.dylib along with the custom dynamically linked shared library. Linking Your Application with the Intel® Math Kernel Library 4 374 Intel® Math Kernel Library for Mac OS* X User's Guide 38Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. 39The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 1D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: 5 Intel® Math Kernel Library for Mac OS* X User's Guide 40Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (pthreads on Mac OS* X). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: libmkl_sequential.a or libmkl_sequential.dylib (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). See Also Using Additional Threading Control Managing Performance and Memory 5 41Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, in the command shell in which the program is going to run, enter: export OMP_NUM_THREADS=. See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" #include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; 5 Intel® Math Kernel Library for Mac OS* X User's Guide 42int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Managing Performance and Memory 5 45Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. 5 Intel® Math Kernel Library for Mac OS* X User's Guide 46Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter : export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE Tips and Techniques to Improve Performance Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). Managing Performance and Memory 5 47LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals64 bytes. Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library 5 Intel® Math Kernel Library for Mac OS* X User's Guide 48Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. Managing Performance and Memory 5 49How to Redefine Memory Functions To redefine memory functions, use the following procedure: 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions 5 Intel® Math Kernel Library for Mac OS* X User's Guide 50Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories libmkl_blas95.a 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. libmkl_blas95_ilp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. libmkl_blas95_lp64.a 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. libmkl_lapack95.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. libmkl_lapack95_lp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. libmkl_lapack95_ilp64.a 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 51File name Contains libfftw2xc_intel.a 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. libfftw2xc_gnu.a Interfaces for FFTW version 2.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw2xf_intel.a Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw2xf_gnu.a Interfaces for FFTW version 2.x (Fortran interface for GNU compiler) to call Intel MKL FFTs. libfftw3xc_intel.a 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. libfftw3xc_gnu.a Interfaces for FFTW version 3.x (C interface for GNU compilers) to call Intel MKL FFTs. libfftw3xf_intel.a 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. libfftw3xf_gnu.a Interfaces for FFTW version 3.x (Fortran interface for GNU compilers) to call Intel MKL FFTs. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into /interfaces/fftw3x*/ makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory /interfaces/blas95 or / interfaces/lapack95 6 Intel® Math Kernel Library for Mac OS* X User's Guide 522. Type one of the following commands depending on your architecture: • For the IA-32 architecture, make libia32 INSTALL_DIR= • For the Intel® 64 architecture, make libintel64 [interface=lp64|ilp64] INSTALL_DIR= Important The parameter INSTALL_DIR is required. As a result, the required library is built and installed in the /lib directory, and the .mod files are built and installed in the /include/[/{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of make: FC=. For example, the command make libintel64 FC=pgf95 INSTALL_DIR= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, make cleania32 INSTALL_DIR= • For the Intel ® 64 architecture, make cleanintel64 [interface=lp64|ilp64] INSTALL_DIR= • For all the architectures, make clean INSTALL_DIR= CAUTION Even if you have administrative rights, avoid setting INSTALL_DIR=../.. or INSTALL_DIR= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Language-specific Usage Options 6 53Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: • LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. 6 Intel® Math Kernel Library for Mac OS* X User's Guide 54CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples/lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Language-specific Usage Options 6 55Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; } zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; 6 Intel® Math Kernel Library for Mac OS* X User's Guide 56 for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) Language-specific Usage Options 6 57prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the /examples/ublas/source/sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the BOOST_ROOT parameter in the make command, for instance, when using Boost version 1.37.0: make libia32 BOOST_ROOT = /boost_1_37_0 See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: /examples/java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of FFT functions • ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: /examples/java/examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory 6 Intel® Math Kernel Library for Mac OS* X User's Guide 58• Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in /examples/ java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): /examples/java/docs/index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: /examples/java/wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. Language-specific Usage Options 6 59The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the make utility, which is typically provided with the Mac OS* X distribution. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation for all the supported architectures: • J2SE* SDK 1.4.2 and JDK 5.0 from Apple Computer, Inc. (http://apple.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: • java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example : export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home export PATH=${JAVA_HOME}/bin:${PATH} You may also need to clear the JDK_HOME environment variable, if it is assigned a value: unset JDK_HOME To start the examples, use the makefile found in the Intel MKL Java examples directory: make {dylibia32|libia32} [function=...] [compiler=...] If you type the make command and omit the target (for example, dylibia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. 6 Intel® Math Kernel Library for Mac OS* X User's Guide 60Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. Language-specific Usage Options 6 616 Intel® Math Kernel Library for Mac OS* X User's Guide 62Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 63Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Mac OS* X User's Guide 64Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library 8 Configuring the Apple Xcode* Developer Software to Link with Intel® Math Kernel Library This section provides information on linking Intel MKL with the Apple Xcode* developer software. Please note that the screen shots are from Apple Xcode* 2.4 and may be different in other versions, whereas the fundamental steps to configuring Xcode* for use with Intel MKL are more widely applicable: 1. Open your project that uses Intel MKL. 2. Under Targets, double-click the active target. In the Target dialog box, assign values to the build settings as explained in the next steps. 3. Click the plus icon under the Build Settings table, located at the bottom of the dialog box, to add a row. In the new row, type HEADER_SEARCH_PATHS under Name and the path to the Intel® MKL include files, that is, /include, under Value: 654. Click the plus icon under the Build Settings table to add another row, in which type LIBRARY_SEARCH_PATHS under Name and the path to the Intel MKL libraries, such as /lib, under Value. 5. Double-click OTHER_LDFLAGS under Name and under Value, type linker options for additional libraries (for example, -lmkl_core -lguide -lpthread). 6. (Optional, needed only for dynamic linking) Under Executables, double-click the active executable, click the Arguments tab, and under Variables to be set in the environment, add DYLD_LIBRARY_PATH with the value of /lib. See Also Notational Conventions Linking in Detail 8 Intel® Math Kernel Library for Mac OS* X User's Guide 66Intel® Optimized LINPACK Benchmark for Mac OS* X 9 Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Mac OS* X contains the following files, located in the ./ benchmarks/linpack/ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in ./benchmarks/ linpack/ Description linpack_cd32.app The 32-bit program executable for a system using Intel® Core™ Duo processor on Mac OS* X. linpack_cd64.app The 64-bit program executable for a system using Intel® Core™ microarchitecture on Mac OS* X. runme32 A sample shell script for executing a pre-determined problem set for linpack_cd32.appOMP_NUM_THREADS set to 2 cores. runme64 A sample shell script for executing a pre-determined problem set for linpack_cd64.appOMP_NUM_THREADS set to 2 cores. lininput Input file for pre-determined problem for the runme32 script. lin_cd32.txt Result of the runme32 script execution. lin_cd64.txt Result of the runme64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: 67./runme32 ./runme64 To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: ./linpack_cd32.app -e ./linpack_cd64.app -e The pre-defined data input filelininput is provided merely as an example. Different systems have different amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. lininput requires at least 2 GB of memory. If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Mac OS* X: • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. • The binary will hang if it is not given an input file or any other arguments. 9 Intel® Math Kernel Library for Mac OS* X User's Guide 68Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes †† GMP Arithmetic Functions are deprecated and will be removed in a future release. 69Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.fi mkl_service.h Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. A Intel® Math Kernel Library for Mac OS* X User's Guide 70See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 71A Intel® Math Kernel Library for Mac OS* X User's Guide 72Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 73B Intel® Math Kernel Library for Mac OS* X User's Guide 74Directory Structure in Detail C Tables in this section show contents of the /lib directory. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Static Libraries in the lib directory File Contents Interface layer libmkl_intel.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_lp64.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_ilp64.a Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support ILP64 interface or on IA-32 architecture systems. libmkl_intel_sp2dp.a SP2DP interface library for the Intel compilers. Threading layer libmkl_intel_thread.a Threading library for the Intel compilers libmkl_pgi_thread.a Threading library for the PGI* compiler libmkl_sequential.a Sequential library Computational layer libmkl_core.a Kernel library libmkl_solver_lp64.a Deprecated. Empty library for backward compatibility libmkl_solver_lp64_sequential.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64.a Deprecated. Empty library for backward compatibility libmkl_solver_ilp64_sequential.a Deprecated. Empty library for backward compatibility 75Dynamic Libraries in the lib directory File Contents libmkl_rt.dylib Single Dynamic Library Interface layer libmkl_intel.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_lp64.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support LP64 interface or on IA-32 architecture systems. libmkl_intel_ilp64.dylib Interface library for the Intel compilers. To be used on Intel® 64 architecture systems to support ILP64 interface or on IA-32 architecture systems. libmkl_intel_sp2dp.dylib SP2DP interface library for the Intel compilers. Threading layer libmkl_intel_thread.dylib Threading library for the Intel compilers libmkl_sequential.dylib Sequential library Computational layer libmkl_core.dylib Contains the dispatcher for dynamic load of the processor-specific kernel library libmkl_lapack.dylib LAPACK and DSS/PARDISO routines and drivers libmkl_mc.dylib 64-bit kernel for processors based on the Intel® Core™ microarchitecture libmkl_mc3.dylib 64-bit kernel for the Intel® Core™ i7 processors libmkl_p4p.dylib 32-bit kernel for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. libmkl_p4m.dylib 32-bit kernel for the Intel® Core™ microarchitecture libmkl_p4m3.dylib 32-bit kernel library for the Intel® Core™ i7 processors libmkl_vml_mc.dylib 64-bit VML for processors based on the Intel® Core™ microarchitecture libmkl_vml_mc2.dylib 64-bit VML/VSL for 45nm Hi-k Intel® Core™2 and the Intel Xeon® processor families libmkl_vml_mc3.dylib 64-bit VML/VSL for the Intel® Core™ i7 processors libmkl_vml_p4p.dylib 32-bit VML for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) libmkl_vml_p4m.dylib 32-bit VML for processors based on Intel® Core™ microarchitecture libmkl_vml_p4m2.dylib 32-bit VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families C Intel® Math Kernel Library for Mac OS* X User's Guide 76File Contents libmkl_vml_p4m3.dylib 32-bit VML/VSL for the Intel® Core™ i7 processors libmkl_vml_avx.dylib VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) RTL locale/en_US/mkl_msg.cat Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English Directory Structure in Detail C 77C Intel® Math Kernel Library for Mac OS* X User's Guide 78Index A aligning data 63 architecture support 21 B BLAS calling routines from C 54 Fortran 95 interface to 52 threaded routines 39 C C interface to LAPACK, use of 54 C, calling LAPACK, BLAS, CBLAS from 54 C/C++, Intel(R) MKL complex types 55 calling BLAS functions from C 55 CBLAS interface from C 55 complex BLAS Level 1 function from C 55 complex BLAS Level 1 function from C++ 55 Fortran-style routines from C 54 CBLAS interface, use of 54 code examples, use of 19 coding data alignment techniques to improve performance 47 compilation, Intel(R) MKL version-dependent 64 compiler run-time libraries, linking with 34 compiler-dependent function 53 complex types in C and C++, Intel(R) MKL 55 computation results, consistency 63 conditional compilation 64 consistent results 63 conventions, notational 13 custom dynamically linked shared library building 35 composing list of functions 36 specifying function names 36 D denormal number, performance 49 directory structure documentation 23 high-level 21 in-detail documentation directories, contents 23 man pages 24 E Enter index keyword 25 environment variables, setting 17 examples, linking 27 F FFT interface data alignment 47 optimised radices 49 threaded problems 39 FFTW interface support 73 Fortran 95 interface libraries 33 G GNU* Multiple Precision Arithmetic Library 73 H header files, Intel(R) MKL 70 HT technology, configuration tip 48 I ILP64 programming, support for 31 include files, Intel(R) MKL 70 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 48 interface Fortran 95, libraries 33 LP64 and ILP64, use of 31 interface libraries and modules, Intel(R) MKL 51 interface libraries, linking with 31 J Java* examples 58 L language interfaces support 69 language-specific interfaces interface libraries and modules 51 LAPACK C interface to, use of 54 calling routines from C 54 Fortran 95 interface to 52 performance of packed routines 47 threaded routines 39 layers, Intel(R) MKL structure 22 libraries to link with interface 31 run-time 34 system libraries 34 threading 33 link tool, command line 27 link-line syntax 29 linking examples 27 linking with compiler run-time libraries 34 interface libraries 31 system libraries 34 threading libraries 33 linking, quick start 25 linking, Web-based advisor 27 LINPACK benchmark Index 79M man pages, viewing 24 memory functions, redefining 49 memory management 49 memory renaming 49 mixed-language programming 53 module, Fortran 95 52 N notational conventions 13 number of threads changing at run time 42 changing with OpenMP* environment variable 42 Intel(R) MKL choice, particular cases 45 techniques to set 42 P parallel performance 41 parallelism, of Intel(R) MKL 39 performance with denormals 49 with subnormals 49 S SDL 26, 30 sequential mode of Intel(R) MKL 33 Single Dynamic Library 26, 30 structure high-level 21 in-detail model 22 support, technical 11 supported architectures 21 syntax, link-line 29 system libraries, linking with 34 T technical support 11 thread safety, of Intel(R) MKL 39 threaded functions 39 threaded problems 39 threading control, Intel(R) MKL-specific 44 threading libraries, linking with 33 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 57 unstable output, getting rid of 63 usage information 15 X Xcode*, configuring 65 Intel® Math Kernel Library for Mac OS* X User's Guide 80 Intel ® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-018US Legal InformationContents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel ® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23 High-level Directory Structure....................................................................23 Layered Model Concept.............................................................................25 Contents of the Documentation Directories..................................................26 Chapter 4: Linking Your Application with the Intel® Math Kernel Library Linking Quick Start...................................................................................27 Using the /Qmkl Compiler Option.......................................................27 Automatically Linking a Project in the Visual Studio* Integrated Development Environment with Intel ® MKL......................................28 Automatically Linking Your Microsoft Visual C/C++* Project with Intel ® MKL..........................................................................28 Automatically Linking Your Intel ® Visual Fortran Project with Intel ® MKL..........................................................................28 Using the Single Dynamic Library.......................................................28 Selecting Libraries to Link with..........................................................29 Using the Link-line Advisor................................................................29 Using the Command-line Link Tool.....................................................30 Linking Examples.....................................................................................30 Linking on IA-32 Architecture Systems...............................................30 Linking on Intel(R) 64 Architecture Systems........................................31 Linking in Detail.......................................................................................31 Dynamically Selecting the Interface and Threading Layer......................32 Linking with Interface Libraries..........................................................33 Using the cdecl and stdcall Interfaces.........................................33 Using the ILP64 Interface vs. LP64 Interface...............................34 Linking with Fortran 95 Interface Libraries..................................36 Contents 3Linking with Threading Libraries.........................................................36 Sequential Mode of the Library..................................................36 Selecting the Threading Layer...................................................36 Linking with Computational Libraries..................................................37 Linking with Compiler Run-time Libraries............................................38 Linking with System Libraries............................................................38 Building Custom Dynamic-link Libraries.......................................................39 Using the Custom Dynamic-link Library Builder in the Command-line Mode.........................................................................................39 Composing a List of Functions ..........................................................40 Specifying Function Names...............................................................41 Building a Custom Dynamic-link Library in the Visual Studio* Development System...................................................................41 Distributing Your Custom Dynamic-link Library....................................42 Chapter 5: Managing Performance and Memory Using Parallelism of the Intel ® Math Kernel Library........................................43 Threaded Functions and Problems......................................................43 Avoiding Conflicts in the Execution Environment..................................45 Techniques to Set the Number of Threads...........................................46 Setting the Number of Threads Using an OpenMP* Environment Variable......................................................................................46 Changing the Number of Threads at Run Time.....................................46 Using Additional Threading Control.....................................................48 Intel MKL-specific Environment Variables for Threading Control. . . . .48 MKL_DYNAMIC........................................................................49 MKL_DOMAIN_NUM_THREADS..................................................50 Setting the Environment Variables for Threading Control..............51 Tips and Techniques to Improve Performance..............................................52 Coding Techniques...........................................................................52 Hardware Configuration Tips.............................................................53 Managing Multi-core Performance......................................................53 Operating on Denormals...................................................................54 FFT Optimized Radices.....................................................................54 Using Memory Management ......................................................................54 Intel MKL Memory Management Software............................................54 Redefining Memory Functions............................................................55 Chapter 6: Language-specific Usage Options Using Language-Specific Interfaces with Intel ® Math Kernel Library.................57 Interface Libraries and Modules.........................................................57 Fortran 95 Interfaces to LAPACK and BLAS..........................................59 Compiler-dependent Functions and Fortran 90 Modules.........................59 Using the stdcall Calling Convention in C/C++.....................................60 Compiling an Application that Calls the Intel ® Math Kernel Library and Uses the CVF Calling Conventions..................................................60 Mixed-language Programming with the Intel Math Kernel Library....................61 Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments..............................................................................61 Using Complex Types in C/C++.........................................................62 Intel® Math Kernel Library for Windows* OS User's Guide 4Calling BLAS Functions that Return the Complex Values in C/C++ Code..........................................................................................63 Support for Boost uBLAS Matrix-matrix Multiplication...........................64 Invoking Intel MKL Functions from Java* Applications...........................65 Intel MKL Java* Examples........................................................66 Running the Java* Examples.....................................................67 Known Limitations of the Java* Examples...................................68 Chapter 7: Coding Tips Aligning Data for Consistent Results...........................................................69 Using Predefined Preprocessor Symbols for Intel ® MKL Version-Dependent Compilation.........................................................................................70 Chapter 8: Working with the Intel® Math Kernel Library Cluster Software MPI Support............................................................................................71 Linking with ScaLAPACK and Cluster FFTs....................................................71 Determining the Number of Threads...........................................................73 Using DLLs..............................................................................................73 Setting Environment Variables on a Cluster.................................................74 Building ScaLAPACK Tests.........................................................................74 Examples for Linking with ScaLAPACK and Cluster FFT..................................74 Examples for Linking a C Application..................................................75 Examples for Linking a Fortran Application..........................................75 Chapter 9: Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library .............................................................................77 Configuring the Microsoft Visual C/C++* Development System to Link with Intel ® MKL............................................................................77 Configuring Intel ® Visual Fortran to Link with Intel MKL.........................77 Running an Intel MKL Example in the Visual Studio* 2008 IDE...............78 Creating, Configuring, and Running the Intel ® C/C++ and/or Visual C++* 2008 Project.....................................................78 Creating, Configuring, and Running the Intel Visual Fortran Project...............................................................................80 Support Files for Intel ® Math Kernel Library Examples...................81 Known Limitations of the Project Creation Procedure....................82 Getting Assistance for Programming in the Microsoft Visual Studio* IDE .........82 Viewing Intel MKL Documentation in Visual Studio* IDE........................82 Using Context-Sensitive Help............................................................83 Using the IntelliSense* Capability......................................................84 Chapter 10: LINPACK and MP LINPACK Benchmarks Intel ® Optimized LINPACK Benchmark for Windows* OS................................87 Contents of the Intel ® Optimized LINPACK Benchmark..........................87 Running the Software.......................................................................88 Known Limitations of the Intel ® Optimized LINPACK Benchmark.............89 Intel ® Optimized MP LINPACK Benchmark for Clusters...................................89 Overview of the Intel ® Optimized MP LINPACK Benchmark for Clusters....89 Contents 5Contents of the Intel ® Optimized MP LINPACK Benchmark for Clusters. . . .90 Building the MP LINPACK..................................................................91 New Features of Intel ® Optimized MP LINPACK Benchmark....................91 Benchmarking a Cluster....................................................................92 Options to Reduce Search Time.........................................................92 Appendix A: Intel® Math Kernel Library Language Interfaces Support Language Interfaces Support, by Function Domain.......................................95 Include Files............................................................................................96 Appendix B: Support for Third-Party Interfaces GMP* Functions.......................................................................................99 FFTW Interface Support............................................................................99 Appendix C: Directory Structure in Detail Detailed Structure of the IA-32 Architecture Directories...............................101 Static Libraries in the lib\ia32 Directory............................................101 Dynamic Libraries in the lib\ia32 Directory........................................102 Contents of the redist\ia32\mkl Directory..........................................102 Detailed Structure of the Intel ® 64 Architecture Directories..........................103 Static Libraries in the lib\intel64 Directory.........................................104 Dynamic Libraries in the lib\intel64 Directory.....................................105 Contents of the redist\intel64\mkl Directory......................................105 Intel® Math Kernel Library for Windows* OS User's Guide 6Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java is a registered trademark of Oracle and/or its affiliates. Copyright © 2007 - 2011, Intel Corporation. All rights reserved. Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation. 7Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Math Kernel Library for Windows* OS User's Guide 8Introducing the Intel® Math Kernel Library The Intel ® Math Kernel Library (Intel ® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel ® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. Intel MKL provides the following major functionality: • Linear algebra, implemented in LAPACK (solvers and eigensolvers) plus level 1, 2, and 3 BLAS, offering the vector, vector-matrix, and matrix-matrix operations needed for complex mathematical software. If you prefer the FORTRAN 90/95 programming language, you can call LAPACK driver and computational subroutines through specially designed interfaces with reduced numbers of arguments. A C interface to LAPACK is also available. • ScaLAPACK (SCAlable LAPACK) with its support functionality including the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS). ScaLAPACK is available for Intel MKL for Linux* and Windows* operating systems. • Direct sparse solver, an iterative sparse solver, and a supporting set of sparse BLAS (level 1, 2, and 3) for solving sparse systems of equations. • Multidimensional discrete Fourier transforms (1D, 2D, 3D) with a mixed radix support (for sizes not limited to powers of 2). Distributed versions of these functions are provided for use on clusters on the Linux* and Windows* operating systems. • A set of vectorized transcendental functions called the Vector Math Library (VML). For most of the supported processors, the Intel MKL VML functions offer greater performance than the libm (scalar) functions, while keeping the same high accuracy. • The Vector Statistical Library (VSL), which offers high performance vectorized random number generators for several probability distributions, convolution and correlation routines, and summary statistics functions. • Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search. For details see the Intel® MKL Reference Manual. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 9 Intel® Math Kernel Library for Windows* OS User's Guide 10Getting Help and Support Intel provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/. The Intel MKL documentation integrates into the Microsoft Visual Studio* integrated development environment (IDE). See Getting Assistance for Programming in the Microsoft Visual Studio* IDE. 11 Intel® Math Kernel Library for Windows* OS User's Guide 12Notational Conventions The following term is used in reference to the operating system. Windows* OS This term refers to information that is valid on all supported Windows* operating systems. The following notations are used to refer to Intel MKL directories. The installation directory for the Intel® C++ Composer XE or Intel® Visual Fortran Composer XE . The main directory where Intel MKL is installed: =\mkl. Replace this placeholder with the specific pathname in the configuring, linking, and building instructions. The following font conventions are used in this document. Italic Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual. Monospace lowercase mixed with uppercase Indicates: • Commands and command-line options, for example, ifort myprog.f mkl_blas95.lib mkl_c.lib libiomp5md.lib • Filenames, directory names, and pathnames, for example, C:\Program Files\Java\jdk1.5.0_09 • C/C++ code fragments, for example, a = new double [SIZE*SIZE]; UPPERCASE MONOSPACE Indicates system variables, for example, $MKLPATH. Monospace italic Indicates a parameter in discussions, for example, lda. When enclosed in angle brackets, indicates a placeholder for an identifier, an expression, a string, a symbol, or a value, for example, . Substitute one of these items for the placeholder. [ items ] Square brackets indicate that the items enclosed in brackets are optional. { item | item } Braces indicate that only one of the items listed between braces should be selected. A vertical bar ( | ) separates the items. 13 Intel® Math Kernel Library for Windows* OS User's Guide 14Overview 1 Document Overview The Intel® Math Kernel Library (Intel® MKL) User's Guide provides usage information for the library. The usage information covers the organization, configuration, performance, and accuracy of Intel MKL, specifics of routine calls in mixed-language programming, linking, and more. This guide describes OS-specific usage of Intel MKL, along with OS-independent features. The document contains usage information for all Intel MKL function domains. This User's Guide provides the following information: • Describes post-installation steps to help you start using the library • Shows you how to configure the library with your development environment • Acquaints you with the library structure • Explains how to link your application with the library and provides simple usage scenarios • Describes how to code, compile, and run your application with Intel MKL This guide is intended for Windows OS programmers with beginner to advanced experience in software development. See Also Language Interfaces Support, by Function Domain What's New This User's Guide documents the Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8. The document was updated to reflect addition of Data Fitting Functions to the product and to describe how to build a custom dynamic-link library in the Visual Studio* Development System (see Building a Custom Dynamic-link Library in the Visual Studio* Development System). Related Information To reference how to use the library in your application, use this guide in conjunction with the following documents: • The Intel® Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values. • The Intel® Math Kernel Library for Windows* OS Release Notes. 151 Intel® Math Kernel Library for Windows* OS User's Guide 16Getting Started 2 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Checking Your Installation After installing the Intel® Math Kernel Library (Intel® MKL), verify that the library is properly installed and configured: 1. Intel MKL installs in . Check that the subdirectory of referred to as was created. Check that subdirectories for Intel MKL redistributable DLLs redist\ia32\mkl and redist \intel64\mkl were created in the directory (See redist.txt in the Intel MKL documentation directory for a list of files that can be redistributed.) 2. If you want to keep multiple versions of Intel MKL installed on your system, update your build scripts to point to the correct Intel MKL version. 3. Check that the following files appear in the \bin directory and its subdirectories: mklvars.bat ia32\mklvars_ia32.bat intel64\mklvars_intel64.bat Use these files to assign Intel MKL-specific values to several environment variables, as explained in Setting Environment Variables 4. To understand how the Intel MKL directories are structured, see Intel® Math Kernel Library Structure. 5. To make sure that Intel MKL runs on your system, do one of the following: • Launch an Intel MKL example, as explained in Using Code Examples • In the Visual Studio* IDE, create and run a simple project that uses Intel MKL, as explained in Running an Intel MKL Example in the Visual Studio IDE See Also Notational Conventions Setting Environment Variables When the installation of Intel MKL for Windows* OS is complete, set the PATH, LIB, and INCLUDE environment variables in the command shell using one of the script files in the bin subdirectory of the Intel MKL installation directory: ia32\mklvars_ia32.bat for the IA-32 architecture, 17intel64\mklvars_intel64.bat for the Intel® 64 architecture, mklvars.bat for the IA-32 and Intel® 64 architectures. Running the Scripts The scripts accept parameters to specify the following: • Architecture. • Addition of a path to Fortran 95 modules precompiled with the Intel ® Fortran compiler to the INCLUDE environment variable. Supply this parameter only if you are using the Intel ® Fortran compiler. • Interface of the Fortran 95 modules. This parameter is needed only if you requested addition of a path to the modules. Usage and values of these parameters depend on the script. The following table lists values of the script parameters. Script Architecture (required, when applicable) Addition of a Path to Fortran 95 Modules (optional) Interface (optional) mklvars_ia32 n/a † mod n/a mklvars_intel64 n/a mod lp64, default ilp64 mklvars ia32 intel64 mod lp64, default ilp64 † Not applicable. For example: • The command mklvars_ia32 sets environment variables for the IA-32 architecture and adds no path to the Fortran 95 modules. • The command mklvars_intel64 mod ilp64 sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the ILP64 interface to the INCLUDE environment variable. • The command mklvars intel64 mod sets environment variables for the Intel ® 64 architecture and adds the path to the Fortran 95 modules for the LP64 interface to the INCLUDE environment variable. NOTE Supply the parameter specifying the architecture first, if it is needed. Values of the other two parameters can be listed in any order. See Also High-level Directory Structure Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Setting the Number of Threads Using an OpenMP* Environment Variable 2 Intel® Math Kernel Library for Windows* OS User's Guide 18Compiler Support Intel MKL supports compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. Although Compaq no longer supports the Compaq Visual Fortran* (CVF) compiler, Intel MKL still preserves the CVF interface in the IA-32 architecture implementation. You can use this interface with the Intel® Fortran Compiler. Intel MKL provides both stdcall (default CVF interface) and cdecl (default interface of the Microsoft Visual C* application) interfaces for the IA-32 architecture. Intel MKL provides a set of include files to simplify program development by specifying enumerated values and prototypes for the respective functions. Calling Intel MKL functions from your application without an appropriate include file may lead to incorrect behavior of the functions. See Also Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Using the cdecl and stdcall Interfaces Include Files Using Code Examples The Intel MKL package includes code examples, located in the examples subdirectory of the installation directory. Use the examples to determine: • Whether Intel MKL is working on your system • How you should call the library • How to link the library The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For example, the examples\spblas subdirectory contains a makefile to build the Sparse BLAS examples and the examples\vmlc subdirectory contains the makefile to build the C VML examples. Source code for the examples is in the next-level sources subdirectory. See Also High-level Directory Structure Running an Intel MKL Example in the Visual Studio* 2008 IDE What You Need to Know Before You Begin Using the Intel® Math Kernel Library Target platform Identify the architecture of your target machine: • IA-32 or compatible • Intel® 64 or compatible Reason: Because Intel MKL libraries are located in directories corresponding to your particular architecture (see Architecture Support), you should provide proper paths on your link lines (see Linking Examples). To configure your development environment for the use with Intel MKL, set your environment variables using the script corresponding to your architecture (see Setting Environment Variables for details). Mathematical problem Identify all Intel MKL function domains that you require: • BLAS • Sparse BLAS Getting Started 2 19• LAPACK • PBLAS • ScaLAPACK • Sparse Solver routines • Vector Mathematical Library functions (VML) • Vector Statistical Library functions • Fourier Transform functions (FFT) • Cluster FFT • Trigonometric Transform routines • Poisson, Laplace, and Helmholtz Solver routines • Optimization (Trust-Region) Solver routines • Data Fitting Functions • GMP* arithmetic functions. Deprecated and will be removed in a future release Reason: The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Additionally, if you are using the Intel MKL cluster software, your link line is function-domain specific (see Working with the Cluster Software). Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance). Programming language Intel MKL provides support for both Fortran and C/C++ programming. Identify the language interfaces that your function domains support (see Intel® Math Kernel Library Language Interfaces Support). Reason: Intel MKL provides language-specific include files for each function domain to simplify program development (see Language Interfaces Support, by Function Domain). For a list of language-specific interface libraries and modules and an example how to generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel Library. Range of integer data If your system is based on the Intel 64 architecture, identify whether your application performs calculations with large data arrays (of more than 2 31 -1 elements). Reason: To operate on large data arrays, you need to select the ILP64 interface, where integers are 64-bit; otherwise, use the default, LP64, interface, where integers are 32-bit (see Using the ILP64 Interface vs. LP64 Interface). Threading model Identify whether and how your application is threaded: • Threaded with the Intel compiler • Threaded with a third-party compiler • Not threaded Reason: The compiler you use to thread your application determines which threading library you should link with your application. For applications threaded with a third-party compiler you may need to use Intel MKL in the sequential mode (for more information, see Sequential Mode of the Library and Linking with Threading Libraries). Number of threads Determine the number of threads you want Intel MKL to use. Reason: Intel MKL is based on the OpenMP* threading. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, see Using Parallelism of the Intel® Math Kernel Library. Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: • Static 2 Intel® Math Kernel Library for Windows* OS User's Guide 20• Dynamic Reason: The link libraries for static and dynamic linking are different. For the list of link libraries for static and dynamic models, linking examples, and other relevant topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with the Intel® Math Kernel Library. MPI used Decide what MPI you will use with the Intel MKL cluster software. You are strongly encouraged to use Intel® MPI 3.2 or later. MPI used Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries corresponding to your particular MPI should be listed on the link line (see Working with the Cluster Software). Getting Started 2 212 Intel® Math Kernel Library for Windows* OS User's Guide 22Structure of the Intel® Math Kernel Library 3 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Architecture Support Intel® Math Kernel Library (Intel® MKL) for Windows* OS provides two architecture-specific implementations. The following table lists the supported architectures and directories where each architecture-specific implementation is located. Architecture Location IA-32 or compatible \lib\ia32 \redist\ia32\mkl (DLLs) Intel® 64 or compatible \lib\intel64 \redist \intel64\mkl (DLLs) See Also High-level Directory Structure Detailed Structure of the IA-32 Architecture Directories Detailed Structure of the Intel® 64 Architecture Directories High-level Directory Structure Directory Contents Installation directory of the Intel® Math Kernel Library (Intel® MKL) Subdirectories of bin Batch files to set environmental variables in the user shell bin\ia32 Batch files for the IA-32 architecture bin\intel64 Batch files for the Intel® 64 architecture benchmarks\linpack Shared-Memory (SMP) version of the LINPACK benchmark benchmarks\mp_linpack Message-passing interface (MPI) version of the LINPACK benchmark 23Directory Contents lib\ia32 Static libraries and static interfaces to DLLs for the IA-32 architecture lib\intel64 Static libraries and static interfaces to DLLs for the Intel® 64 architecture examples Examples directory. Each subdirectory has source and data files include INCLUDE files for the library routines, as well as for tests and examples include\ia32 Fortran 95 .mod files for the IA-32 architecture and Intel Fortran compiler include\intel64\lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel® Fortran compiler, and LP64 interface include\intel64\ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler, and ILP64 interface include\fftw Header files for the FFTW2 and FFTW3 interfaces interfaces\blas95 Fortran 95 interfaces to BLAS and a makefile to build the library interfaces\fftw2x_cdft MPI FFTW 2.x interfaces to Intel MKL Cluster FFTs interfaces\fftw3x_cdft MPI FFTW 3.x interfaces to Intel MKL Cluster FFTs interfaces\fftw2xc FFTW 2.x interfaces to the Intel MKL FFTs (C interface) interfaces\fftw2xf FFTW 2.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces\fftw3xc FFTW 3.x interfaces to the Intel MKL FFTs (C interface) interfaces\fftw3xf FFTW 3.x interfaces to the Intel MKL FFTs (Fortran interface) interfaces\lapack95 Fortran 95 interfaces to LAPACK and a makefile to build the library tests Source and data files for tests tools Commad-line link tool and tools for creating custom dynamically linkable libraries tools\builder Tools for creating custom dynamically linkable libraries Subdirectories of redist\ia32\mkl DLLs for applications running on processors with the IA-32 architecture redist\intel64\mkl DLLs for applications running on processors with Intel® 64 architecture Documentation\en_US\MKL Intel MKL documentation Documentation\vshelp \1033\ intel.mkldocs Help2-format files for integration of the Intel MKL documentation with the Microsoft Visual Studio* 2005/2008 IDE Documentation\msvhelp \1033\mkl Microsoft Help Viewer*-format files for integration of the Intel MKL documentation with the Microsoft Visual Studio* 2010 IDE See Also Notational Conventions 3 Intel® Math Kernel Library for Windows* OS User's Guide 24Layered Model Concept Intel MKL is structured to support multiple compilers and interfaces, different OpenMP* implementations, both serial and multiple threads, and a wide range of processors. Conceptually Intel MKL can be divided into distinct parts to support different interfaces, threading models, and core computations: 1. Interface Layer 2. Threading Layer 3. Computational Layer You can combine Intel MKL libraries to meet your needs by linking with one library in each part layer-bylayer. Once the interface library is selected, the threading library you select picks up the chosen interface, and the computational library uses interfaces and OpenMP implementation (or non-threaded mode) chosen in the first two layers. To support threading with different compilers, one more layer is needed, which contains libraries not included in Intel MKL: • Compiler run-time libraries (RTL). The following table provides more details of each layer. Layer Description Interface Layer This layer matches compiled code of your application with the threading and/or computational parts of the library. This layer provides: • cdecl and CVF default interfaces. • LP64 and ILP64 interfaces. • Compatibility with compilers that return function values differently. • A mapping between single-precision names and double-precision names for applications using Cray*-style naming (SP2DP interface). SP2DP interface supports Cray-style naming in applications targeted for the Intel 64 architecture and using the ILP64 interface. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the application and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions ?GEMM: SGEMM -> DGEMM DGEMM -> DGEMM CGEMM -> ZGEMM ZGEMM -> ZGEMM Mind that no changes are made to double-precision names. Threading Layer This layer: • Provides a way to link threaded Intel MKL with different threading compilers. • Enables you to link with a threaded or sequential mode of the library. This layer is compiled for different environments (threaded or sequential) and compilers (from Intel, Microsoft, and so on). Computational Layer This layer is the heart of Intel MKL. It has only one library for each combination of architecture and supported OS. The Computational layer accommodates multiple architectures through identification of architecture features and chooses the appropriate binary code at run time. Compiler Run-time Libraries (RTL) To support threading with Intel compilers, Intel MKL uses RTLs of the Intel® C++ Composer XE or Intel® Visual Fortran Composer XE. To thread using third-party threading compilers, use libraries in the Threading layer or an appropriate compatibility library. See Also Using the ILP64 Interface vs. LP64 Interface Structure of the Intel® Math Kernel Library 3 25Linking Your Application with the Intel® Math Kernel Library Linking with Threading Libraries Contents of the Documentation Directories Most of Intel MKL documentation is installed at \Documentation\ \mkl. For example, the documentation in English is installed at \Documentation\en_US\mkl. However, some Intel MKL-related documents are installed one or two levels up. The following table lists MKL-related documentation. File name Comment Files in \Documentation \clicense.rtf or \flicense.rtf Common end user license for the Intel® C++ Composer XE 2011 or Intel® Visual Fortran Composer XE 2011, respectively mklsupport.txt Information on package number for customer support reference Contents of \Documentation\\mkl redist.txt List of redistributable files mkl_documentation.htm Overview and links for the Intel MKL documentation mkl_manual\index.htm Intel MKL Reference Manual in an uncompressed HTML format Release_Notes.htm Intel MKL Release Notes mkl_userguide\index.htm Intel MKL User's Guide in an uncompressed HTML format, this document mkl_link_line_advisor.htm Intel MKL Link-line Advisor 3 Intel® Math Kernel Library for Windows* OS User's Guide 26Linking Your Application with the Intel® Math Kernel Library 4 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Linking Quick Start Intel® Math Kernel Library (Intel® MKL) provides several options for quick linking of your application. The simplest options depend on your development environment: Intel® Composer XE compiler see Using the /Qmkl Compiler Option. Microsoft Visual Studio* Integrated Development Environment (IDE) see Automatically Linking a Project in the Visual Studio* IDE with Intel MKL. Other options are independent of your development environment, but depend on the way you link: Explicit dynamic linking see Using the Single Dynamic Library for how to simplify your link line. Explicitly listing libraries on your link line see Selecting Libraries to Link with for a summary of the libraries. Using an interactive interface see Using the Link-line Advisor to determine libraries and options to specify on your link or compilation line. Using an internally provided tool see Using the Command-line Link Tool to determine libraries, options, and environment variables or even compile and build your application. Using the /Qmkl Compiler Option The Intel® Composer XE compiler supports the following variants of the /Qmkl compiler option: /Qmkl or /Qmkl:parallel to link with standard threaded Intel MKL. /Qmkl:sequential to link with sequential version of Intel MKL. /Qmkl:cluster to link with Intel MKL cluster components (sequential) that use Intel MPI. For more information on the /Qmkl compiler option, see the Intel Compiler User and Reference Guides. For each variant of the /Qmkl option, the compiler links your application using the following conventions: • cdecl for the IA-32 architecture • LP64 for the Intel® 64 architecture If you specify any variant of the /Qmkl compiler option, the compiler automatically includes the Intel MKL libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail. 27See Also Using the ILP64 Interface vs. LP64 Interface Using the Link-line Advisor Intel® Software Documentation Library Automatically Linking a Project in the Visual Studio* Integrated Development Environment with Intel® MKL After a default installation of the Intel® Math Kernel Library (Intel® MKL), Intel® C++ Composer XE, or Intel® Visual Fortran Composer XE, you can easily configure your project to automatically link with Intel MKL. Automatically Linking Your Microsoft Visual C/C++* Project with Intel® MKL Configure your Microsoft Visual C/C++* project for automatic linking with Intel MKL as follows: • For the Visual Studio* 2010 development system: 1. Go to Project>Properties>Configuration Properties>Intel Performance Libraries. 2. Change the Use MKL property setting by selecting Parallel, Sequential, or Cluster as appropriate. • For the Visual Studio 2005/2008 development system: 1. Go to Project>Intel C++ Composer XE 2011>Select Build Components. 2. From the Use MKL drop-down menu, select Parallel, Sequential, or Cluster as appropriate. Specific Intel MKL libraries that link with your application may depend on more project settings. For details, see the Intel® Composer XE documentation. See Also Intel® Software Documentation Library Automatically Linking Your Intel® Visual Fortran Project with Intel® MKL Configure your Intel® Visual Fortran project for automatic linking with Intel MKL as follows: Go to Project > Properties > Libraries > Use Intel Math Kernel Library and select Parallel, Sequential, or Cluster as appropriate. Specific Intel MKL libraries that link with your application may depend on more project settings. For details see the Intel® Visual Fortran Compiler XE User and Reference Guides. See Also Intel® Software Documentation Library Using the Single Dynamic Library You can simplify your link line through the use of the Intel MKL Single Dynamic Library (SDL). To use SDL, place mkl_rt.lib on your link line. For example: icl.exe application.c mkl_rt.lib mkl_rt.lib is the import library for mkl_rt.dll. SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking with SDL provides: • LP64 interface on systems based on the Intel® 64 architecture • Intel threading To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL, you need to specify your choices using functions or environment variables as explained in section Dynamically Selecting the Interface and Threading Layer. 4 Intel® Math Kernel Library for Windows* OS User's Guide 28Selecting Libraries to Link with To link with Intel MKL: • Choose one library from the Interface layer and one library from the Threading layer • Add the only library from the Computational layer and run-time libraries (RTL) The following table lists Intel MKL libraries to link with your application. Interface layer Threading layer Computational layer RTL IA-32 architecture, static linking mkl_intel_c.lib mkl_intel_ thread.lib mkl_core.lib libiomp5md.lib IA-32 architecture, dynamic linking mkl_intel_c_ dll.lib mkl_intel_ thread_dll.lib mkl_core_dll. lib libiomp5md.lib Intel® 64 architecture, static linking mkl_intel_ lp64.lib mkl_intel_ thread.lib mkl_core.lib libiomp5md.lib Intel® 64 architecture, dynamic linking mkl_intel_ lp64_dll.lib mkl_intel_ thread_dll.lib mkl_core_dll. lib libiomp5md.lib The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at run time through function calls or environment settings. SDL RTL IA-32 and Intel® 64 architectures mkl_rt.lib libiomp5md.lib † † Linking with libiomp5md.lib is not required. For exceptions and alternatives to the libraries listed above, see Linking in Detail. See Also Layered Model Concept Using the Link-line Advisor Using the /Qmkl Compiler Option Working with the Intel® Math Kernel Library Cluster Software Using the Link-line Advisor Use the Intel MKL Link-line Advisor to determine the libraries and options to specify on your link or compilation line. The latest version of the tool is available at http://software.intel.com/en-us/articles/intel-mkl-link-lineadvisor. The tool is also available in the product. The Advisor requests information about your system and on how you intend to use Intel MKL (link dynamically or statically, use threaded or sequential mode, etc.). The tool automatically generates the appropriate link line for your application. See Also Contents of the Documentation Directories Linking Your Application with the Intel® Math Kernel Library 4 29Using the Command-line Link Tool Use the command-line Link tool provided by Intel MKL to simplify building your application with Intel MKL. The tool not only provides the options, libraries, and environment variables to use, but also performs compilation and building of your application. The tool mkl_link_tool.exe is installed in the \tools directory. See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for more information. Linking Examples See Also Using the Link-line Advisor Examples for Linking with ScaLAPACK and Cluster FFT Linking on IA-32 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc: • Static linking of myprog.f and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Static linking of myprog.f and sequential version of Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c.lib mkl_sequential.lib mkl_core.lib • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the cdecl interface: ifort myprog.f mkl_intel_c_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib • Static linking of user code myprog.f and parallel Intel MKL supporting the stdcall interface: ifort myprog.f mkl_intel_s.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel Intel MKL supporting the stdcall interface: ifort myprog.f mkl_intel_s_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL supporting the cdecl or stdcall interface (Call the mkl_set_threading_layer function or set value of the MKL_THREADING_LAYER environment variable to choose threaded or sequential mode): ifort myprog.f mkl_rt.lib • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_lapack95.lib mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the cdecl interface: ifort myprog.f mkl_blas95.lib mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib 4 Intel® Math Kernel Library for Windows* OS User's Guide 30See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking on Intel(R) 64 Architecture Systems The following examples illustrate linking that uses Intel(R) compilers. The examples use the .f Fortran source file. C/C++ users should instead specify a .cpp (C++) or .c (C) file and replace ifort with icc: • Static linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib • Dynamic linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface: ifort myprog.f mkl_intel_lp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib • Static linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Dynamic linking of myprog.f and parallel Intel MKL supporting the ILP64 interface: ifort myprog.f mkl_intel_ilp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib • Dynamic linking of user code myprog.f and parallel or sequential Intel MKL supporting the LP64 or ILP64 interface (Call appropriate functions or set environment variables to choose threaded or sequential mode and to set the interface): ifort myprog.f mkl_rt.lib • Static linking of myprog.f, Fortran 95 LAPACK interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib • Static linking of myprog.f, Fortran 95 BLAS interface, and parallel Intel MKL supporting the LP64 interface: ifort myprog.f mkl_blas95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib See Also Fortran 95 Interfaces to LAPACK and BLAS Examples for Linking a C Application Examples for Linking a Fortran Application Using the Single Dynamic Library Linking in Detail This section recommends which libraries to link with depending on your Intel MKL usage scenario and provides details of the linking. Linking Your Application with the Intel® Math Kernel Library 4 31Dynamically Selecting the Interface and Threading Layer The Single Dynamic Library (SDL) enables you to dynamically select the interface and threading layer for Intel MKL. Setting the Interface Layer Available interfaces depend on the architecture of your system. On systems based on the Intel ® 64 architecture, LP64 and ILP64 interfaces are available. To set one of these interfaces at run time, use the mkl_set_interface_layer function or the MKL_INTERFACE_LAYER environment variable. The following table provides values to be used to set each interface. Interface Layer Value of MKL_INTERFACE_LAYER Value of the Parameter of mkl_set_interface_layer LP64 LP64 MKL_INTERFACE_LP64 ILP64 ILP64 MKL_INTERFACE_ILP64 If the mkl_set_interface_layer function is called, the environment variable MKL_INTERFACE_LAYER is ignored. By default the LP64 interface is used. See the Intel MKL Reference Manual for details of the mkl_set_interface_layer function. On systems based on the IA-32 architecture, the cdecl and stdcall interfaces are available. These interfaces have different function naming conventions, and SDL selects between cdecl and stdcall at link time according to the function names. Setting the Threading Layer To set the threading layer at run time, use the mkl_set_threading_layer function or the MKL_THREADING_LAYER environment variable. The following table lists available threading layers along with the values to be used to set each layer. Threading Layer Value of MKL_THREADING_LAYER Value of the Parameter of mkl_set_threading_layer Intel threading INTEL MKL_THREADING_INTEL Sequential mode of Intel MKL SEQUENTIAL MKL_THREADING_SEQUENTIAL PGI threading PGI MKL_THREADING_PGI If the mkl_set_threading_layer function is called, the environment variable MKL_THREADING_LAYER is ignored. By default Intel threading is used. See the Intel MKL Reference Manual for details of the mkl_set_threading_layer function. Replacing Error Handling and Progress Information Routines You can replace the Intel MKL error handling routine xerbla or progress information routine mkl_progress with your own function. If you are using SDL, to replace xerbla or mkl_progress, call the mkl_set_xerbla and mkl_set_progress function, respectively. See the Intel MKL Reference Manual for details. 4 Intel® Math Kernel Library for Windows* OS User's Guide 32NOTE If you are using SDL, you cannot perform the replacement by linking the object file with your implementation of xerbla or mkl_progress. See Also Using the Single Dynamic Library Layered Model Concept Using the cdecl and stdcall Interfaces Directory Structure in Detail Linking with Interface Libraries Using the cdecl and stdcall Interfaces Intel MKL provides the following interfaces in its IA-32 architecture implementation: • stdcall Default Compaq Visual Fortran* (CVF) interface. Use it with the Intel® Fortran Compiler. • cdecl Default interface of the Microsoft Visual C/C++* application. To use each of these interfaces, link with the appropriate library, as specified in the following table: Interface Library for Static Linking Library for Dynamic Linking cdecl mkl_intel_c.lib mkl_intel_c_dll.lib stdcall mkl_intel_s.lib mkl_intel_s_dll.lib To link with the cdecl or stdcall interface library, use appropriate calling syntax in C applications and appropriate compiler options for Fortran applications. If you are using a C compiler, to link with the cdecl or stdcall interface library, call Intel MKL routines in your code as explained in the table below: Interface Library Calling Intel MKL Routines mkl_intel_s [_dll].lib Call a routine with the following statement: extern __stdcall name( , , .. ); where stdcall is actually the CVF compiler default compilation, which differs from the regular stdcall compilation in the way how strings are passed to the routine. Because the default CVF format is not identical with stdcall, you must specially handle strings in the calling sequence. See how to do it in sections on interfaces in the CVF documentation. mkl_intel_c [_dll].lib Use the following declaration: name( , , .. ); If you are using a Fortran compiler, to link with the cdecl or stdcall interface library, provide compiler options as explained in the table below: Interface Library Compiler Options Comment CVF compiler mkl_intel_s[_dll].lib Default mkl_intel_c[_dll].lib /iface=(cref, nomixed_str_len_arg) Linking Your Application with the Intel® Math Kernel Library 4 33Interface Library Compiler Options Comment Intel® Fortran compiler mkl_intel_c[_dll].lib Default mkl_intel_s[_dll].lib /Gm or /iface:cvf /Gm and /iface:cvf options enable compatibility of the CVF and Powerstation calling conventions See Also Using the stdcall Calling Convention in C/C++ Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Using the ILP64 Interface vs. LP64 Interface The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 2 31 -1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type. The LP64 and ILP64 interfaces are implemented in the Interface layer. Link with the following interface libraries for the LP64 or ILP64 interface, respectively: • mkl_intel_lp64.lib or mkl_intel_ilp64.lib for static linking • mkl_intel_lp64_dll.lib or mkl_intel_ilp64_dll.lib for dynamic linking The ILP64 interface provides for the following: • Support large data arrays (with more than 2 31 -1 elements) • Enable compiling your Fortran code with the /4I8 compiler option The LP64 interface provides compatibility with the previous Intel MKL versions because "LP64" is just a new name for the only interface that the Intel MKL versions lower than 9.1 provided. Choose the ILP64 interface if your application uses Intel MKL for calculations with large data arrays or the library may be used so in future. Intel MKL provides the same include directory for the ILP64 and LP64 interfaces. Compiling for LP64/ILP64 The table below shows how to compile for the ILP64 and LP64 interfaces: Fortran Compiling for ILP64 ifort /4I8 /I\include ... Compiling for LP64 ifort /I\include ... C or C++ Compiling for ILP64 icl /DMKL_ILP64 /I\include ... Compiling for LP64 icl /I\include ... CAUTION Linking of an application compiled with the /4I8 or /DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output. Coding for ILP64 You do not need to change existing code if you are not using the ILP64 interface. 4 Intel® Math Kernel Library for Windows* OS User's Guide 34To migrate to ILP64 or write new code for ILP64, use appropriate types for parameters of the Intel MKL functions and subroutines: Integer Types Fortran C or C++ 32-bit integers INTEGER*4 or INTEGER(KIND=4) int Universal integers for ILP64/ LP64: • 64-bit for ILP64 • 32-bit otherwise INTEGER without specifying KIND MKL_INT Universal integers for ILP64/ LP64: • 64-bit integers INTEGER*8 or INTEGER(KIND=8) MKL_INT64 FFT interface integers for ILP64/ LP64 INTEGER without specifying KIND MKL_LONG To determine the type of an integer parameter of a function, use appropriate include files. For functions that support only a Fortran interface, use the C/C++ include files *.h. The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for ILP64. The table applies to most Intel MKL functions except some VML and VSL functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface: • VML: The mode parameter of VML functions is 64-bit. • Random Number Generators (RNG): All discrete RNG except viRngUniformBits64 are 32-bit. The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit. • Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64- bit. Refer to the Intel MKL Reference Manual for more information. To better understand ILP64 interface details, see also examples and tests. Limitations All Intel MKL function domains support ILP64 programming with the following exceptions: • FFTW interfaces to Intel MKL: • FFTW 2.x wrappers do not support ILP64. • FFTW 3.2 wrappers support ILP64 by a dedicated set of functions plan_guru64. • GMP* Arithmetic Functions do not support ILP64. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also High-level Directory Structure Include Files Language Interfaces Support, by Function Domain Layered Model Concept Linking Your Application with the Intel® Math Kernel Library 4 35Directory Structure in Detail Linking with Fortran 95 Interface Libraries The mkl_blas95*.lib and mkl_lapack95*.lib libraries contain Fortran 95 interfaces for BLAS and LAPACK, respectively, which are compiler-dependent. In the Intel MKL package, they are prebuilt for the Intel® Fortran compiler. If you are using a different compiler, build these libraries before using the interface. See Also Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules Linking with Threading Libraries Sequential Mode of the Library You can use Intel MKL in a sequential (non-threaded) mode. In this mode, Intel MKL runs unthreaded code. However, it is thread-safe (except the LAPACK deprecated routine ?lacon), which means that you can use it in a parallel region in your OpenMP* code. The sequential mode requires no compatibility OpenMP* run-time library and does not respond to the environment variable OMP_NUM_THREADS or its Intel MKL equivalents. You should use the library in the sequential mode only if you have a particular reason not to use Intel MKL threading. The sequential mode may be helpful when using Intel MKL with programs threaded with some non-Intel compilers or in other situations where you need a non-threaded version of the library (for instance, in some MPI cases). To set the sequential mode, in the Threading layer, choose the *sequential.* library. See Also Directory Structure in Detail Using Parallelism of the Intel® Math Kernel Library Avoiding Conflicts in the Execution Environment Linking Examples Selecting the Threading Layer Several compilers that Intel MKL supports use the OpenMP* threading technology. Intel MKL supports implementations of the OpenMP* technology that these compilers provide. To make use of this support, you need to link with the appropriate library in the Threading Layer and Compiler Support Run-time Library (RTL). Threading Layer Each Intel MKL threading library contains the same code compiled by the respective compiler (Intel and PGI* compilers on Windows OS). RTL This layer includes libiomp, the compatibility OpenMP* run-time library of the Intel compiler. In addition to the Intel compiler, libiomp provides support for one more threading compiler on Windows OS (Microsoft Visual C++*). That is, a program threaded with the Microsoft Visual C++ compiler can safely be linked with Intel MKL and libiomp. The table below helps explain what threading library and RTL you should choose under different scenarios when using Intel MKL (static cases only): 4 Intel® Math Kernel Library for Windows* OS User's Guide 36Compiler Application Threaded? Threading Layer RTL Recommended Comment Intel Does not matter mkl_intel_ thread.lib libiomp5md.lib PGI Yes mkl_pgi_thread. lib or mkl_sequential. lib PGI* supplied Use of mkl_sequential.lib removes threading from Intel MKL calls. PGI No mkl_intel_ thread.lib libiomp5md.lib PGI No mkl_pgi_thread. lib PGI* supplied PGI No mkl_sequential. lib None Microsoft Yes mkl_intel_ thread.lib libiomp5md.lib For the OpenMP* library of the Microsoft Visual Studio* IDE version 2005 or later. Microsoft Yes mkl_sequential. lib None For Win32 threading. Microsoft No mkl_intel_ thread.lib libiomp5md.lib other Yes mkl_sequential. lib None other No mkl_intel_ thread.lib libiomp5md.lib TIP To use the threaded Intel MKL, compile your code with the /MT option. The compiler driver will pass the option to the linker and the latter will load multi-thread (MT) run-time libraries. Linking with Computational Libraries If you are not using the Intel MKL cluster software, you need to link your application with only one computational library, depending on the linking method: Static Linking Dynamic Linking mkl_core.lib mkl_core_dll.lib Computational Libraries for Applications that Use the Intel MKL Cluster Software ScaLAPACK and Cluster Fourier Transform Functions (Cluster FFT) require more computational libraries, which may depend on your architecture. The following table lists computational libraries for IA-32 architecture applications that use ScaLAPACK or Cluster FFT. Linking Your Application with the Intel® Math Kernel Library 4 37Computational Libraries for IA-32 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK † mkl_scalapack_core.lib mkl_core.lib mkl_scalapack_core_dll.lib mkl_core_dll.lib Cluster Fourier Transform Functions † mkl_cdft_core.lib mkl_core.lib mkl_cdft_core_dll.lib mkl_core_dll.lib † Also add the library with BLACS routines corresponding to the MPI used. The following table lists computational libraries for Intel ® 64 architecture applications that use ScaLAPACK or Cluster FFT. Computational Libraries for the Intel ® 64 Architecture Function domain Static Linking Dynamic Linking ScaLAPACK, LP64 interface 1 mkl_scalapack_lp64.lib mkl_core.lib mkl_scalapack_lp64_dll.lib mkl_core_dll.lib ScaLAPACK, ILP64 interface 1 mkl_scalapack_ilp64.lib mkl_core.lib mkl_scalapack_ilp64_dll.lib mkl_core_dll.lib Cluster Fourier Transform Functions 1 mkl_cdft_core.lib mkl_core.lib mkl_cdft_core_dll.lib mkl_core_dll.lib † Also add the library with BLACS routines corresponding to the MPI used. See Also Linking with ScaLAPACK and Cluster FFTs Using the Link-line Advisor Using the ILP64 Interface vs. LP64 Interface Linking with Compiler Run-time Libraries Dynamically link libiomp, the compatibility OpenMP* run-time library, even if you link other libraries statically. Linking to the libiomp statically can be problematic because the more complex your operating environment or application, the more likely redundant copies of the library are included. This may result in performance issues (oversubscription of threads) and even incorrect results. To link libiomp dynamically, be sure the PATH environment variable is defined correctly. See Also Setting Environment Variables Layered Model Concept Linking with System Libraries If your system is based on the Intel® 64 architecture, be aware that Microsoft SDK builds 1289 or higher provide the bufferoverflowu.lib library to resolve the __security_cookie external references. Makefiles for examples and tests include this library by using the buf_lib=bufferoverflowu.lib macro. If you are using older SDKs, leave this macro empty on your command line as follows: buf_lib= . 4 Intel® Math Kernel Library for Windows* OS User's Guide 38Building Custom Dynamic-link Libraries ?ustom dynamic-link libraries (DLL) reduce the collection of functions available in Intel MKL libraries to those required to solve your particular problems, which helps to save disk space and build your own dynamic libraries for distribution. The Intel MKL custom DLL builder enables you to create a dynamic library containing the selected functions and located in the tools\builder directory. The builder contains a makefile and a definition file with the list of functions. Using the Custom Dynamic-link Library Builder in the Command-line Mode To build a custom DLL, use the following command: nmake target [] The following table lists possible values of target and explains what the command does for each value: Value Comment libia32 The builder uses static Intel MKL interface, threading, and core libraries to build a custom DLL for the IA-32 architecture. libintel64 The builder uses static Intel MKL interface, threading, and core libraries to build a custom DLL for the Intel® 64 architecture. dllia32 The builder uses the single dynamic library libmkl_rt.dll to build a custom DLL for the IA-32 architecture. dllintel64 The builder uses the single dynamic library libmkl_rt.dll to build a custom DLL for the Intel® 64 architecture. help The command prints Help on the custom DLL builder The placeholder stands for the list of parameters that define macros to be used by the makefile. The following table describes these parameters: Parameter [Values] Description interface Defines which programming interface to use.Possible values: • For the IA-32 architecture, {cdecl|stdcall}. The default value is cdecl. • For the Intel 64 architecture, {lp64|ilp64}. The default value is lp64. threading = {parallel| sequential} Defines whether to use the Intel MKL in the threaded or sequential mode. The default value is parallel. export = Specifies the full name of the file that contains the list of entry-point functions to be included in the DLL. The default name is user_example_list (no extension). name = Specifies the name of the dll and interface library to be created. By default, the names of the created libraries are mkl_custom.dll and mkl_custom.lib. xerbla = Specifies the name of the object file .obj that contains the user's error handler. The makefile adds this error handler to the library for use instead of the default Intel MKL error handler xerbla. If you omit this parameter, the native Intel MKL xerbla is used. See the description of the xerbla function in the Intel MKL Reference Manual on how to develop your own error handler. For the IA-32 architecture, the object file should be in the interface defined by the interface macro (cdecl or stdcall). Linking Your Application with the Intel® Math Kernel Library 4 39Parameter [Values] Description MKLROOT = Specifies the location of Intel MKL libraries used to build the custom DLL. By default, the builder uses the Intel MKL installation directory. buf_lib Manages resolution of the __security_cookie external references in the custom DLL on systems based on the Intel® 64 architecture. By default, the makefile uses the bufferoverflowu.lib library of Microsoft SDK builds 1289 or higher. This library resolves the __security_cookie external references. To avoid using this library, set the empty value of this parameter. Therefore, if you are using an older SDK, set buf_lib= . CAUTION Use the buf_lib parameter only with the empty value. Incorrect value of the parameter causes builder errors. crt = Specifies the name of the Microsoft C run-time library to be used to build the custom DLL. By default, the builder uses msvcrt.lib. manifest = {yes|no|embed} Manages the creation of a Microsoft manifest for the custom DLL: • If manifest=yes, the manifest file with the name defined by the name parameter above and the manifest extension will be created. • If manifest=no, the manifest file will not be created. • If manifest=embed, the manifest will be embedded into the DLL. By default, the builder does not use the manifest parameter. All the above parameters are optional. In the simplest case, the command line is nmake ia32, and the missing options have default values. This command creates the mkl_custom.dll and mkl_custom.lib libraries with the cdecl interface for processors using the IA-32 architecture. The command takes the list of functions from the functions_list file and uses the native Intel MKL error handler xerbla. An example of a more complex case follows: nmake ia32 interface=stdcall export=my_func_list.txt name=mkl_small xerbla=my_xerbla.obj In this case, the command creates the mkl_small.dll and mkl_small.lib libraries with the stdcall interface for processors using the IA-32 architecture. The command takes the list of functions from my_func_list.txt file and uses the user's error handler my_xerbla.obj. The process is similar for processors using the Intel® 64 architecture. See Also Linking with System Libraries Composing a List of Functions To compose a list of functions for a minimal custom DLL needed for your application, you can use the following procedure: 1. Link your application with installed Intel MKL libraries to make sure the application builds. 2. Remove all Intel MKL libraries from the link line and start linking. Unresolved symbols indicate Intel MKL functions that your application uses. 3. Include these functions in the list. 4 Intel® Math Kernel Library for Windows* OS User's Guide 40Important Each time your application starts using more Intel MKL functions, update the list to include the new functions. See Also Specifying Function Names Specifying Function Names In the file with the list of functions for your custom DLL, adjust function names to the required interface. For example, you can list the cdecl entry points as follows: DGEMM DTRSM DDOT DGETRF DGETRS cblas_dgemm cblas_ddot You can list the stdcall entry points as follows: _DGEMM@60 _DDOT@20 _DGETRF@24 For more examples, see domain-specific lists of function names in the \tools\builder folder. This folder contains lists of function names for both cdecl or stdcall interfaces. NOTE The lists of function names are provided in the \tools\builder folder merely as examples. See Composing a List of Functions for how to compose lists of functions for your custom DLL. TIP Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or without the trailing underscore. For example, these names are equivalent: BLAS: dgemm, DGEMM, dgemm_, DGEMM_ LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_. Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below: 1. In the mkl_service.h include file, look up a #define directive for your function. 2. Take the function name from the replacement part of that directive. For example, the #define directive for the mkl_disable_fast_mm function is #define mkl_disable_fast_mm MKL_Disable_Fast_MM. Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM. For the names of the Fortran support functions, see the tip. Building a Custom Dynamic-link Library in the Visual Studio* Development System You can build a custom dynamic-link library (DLL) in the Microsoft Visual Studio* Development System (VS*) . To do this, use projects available in the tools\builder\MSVS_Projects subdirectory of the Intel MKL directory. The directory contains the VS2005, VS2008, and VS2010 subdirectories with projects for the respective versions of the Visual Studio Development System. For each version of VS two solutions are available: Linking Your Application with the Intel® Math Kernel Library 4 41• libia32.sln builds a custom DLL for the IA-32 architecture. • libintel64.sln builds a custom DLL for the Intel® 64 architecture. The builder uses the following default settings for the custom DLL: Interface: cdecl for the IA-32 architecture and LP64 for the Intel 64 architecture Error handler: Native Intel MKL xerbla Create Microsoft manifest: yes List of functions: in the project's source file examples.def To build a custom DLL: 1. Open the libia32.sln or libintel64.sln solution depending on the architecture of your system. The solution includes the following projects: • i_malloc_dll • vml_dll_core • cdecl_parallel (in libia32.sln) or lp64_parallel (in libintel64.sln) • cdecl_sequential (in libia32.sln) or lp64_sequential (in libintel64.sln) 2. [Optional] To change any of the default settings, select the project depending on whether the DLL will use Intel MKL functions in the sequential or multi-threaded mode: • In the libia32 solution, select the cdecl_sequential or cdecl_parallel project. • In the libintel64 solution, select the lp64_sequential or lp64_parallel project. 3. [Optional] To build the DLL that uses the stdcall interface for the IA-32 architecture or the ILP64 interface for the Intel 64 architecture: a. Select Project>Properties>Configuration Properties>Linker>Input>Additional Dependencies. b. In the libia32 solution, change mkl_intel_c.lib to mkl_intel_s.lib. In the libintel64 solution, change mkl_intel_lp64.lib to mkl_intel_ilp64.lib. 4. [Optional] To include your own error handler in the DLL: a. Select Project>Properties>Configuration Properties>Linker>Input. b. Add .obj 5. [Optional] To turn off creation of the manifest: a. Select Project>Properties>Configuration Properties>Linker>Manifest File>Generate Manifest. b. Select: no. 6. [Optional] To change the list of functions to be included in the DLL: a. Select Source Files. b. Edit the examples.def file. Refer to Specifying Function Names for how to specify entry points. 7. To build the library: • In VS2005 - VS2008, select Build>Project Only>Link Only and link projects in this order: i_malloc_dll, vml_dll_core, cdecl_sequential/lp64_sequential or cdecl_ parallel/ lp64_parallel. • In VS2010, select Build>Build Solution. See Also Using the Custom Dynamic-link Library Builder in the Command-line Mode Distributing Your Custom Dynamic-link Library To enable use of your custom DLL in a threaded mode, distribute libiomp5md.dll along with the custom DLL. 4 Intel® Math Kernel Library for Windows* OS User's Guide 42Managing Performance and Memory 5 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Parallelism of the Intel® Math Kernel Library Intel MKL is extensively parallelized. See Threaded Functions and Problems for lists of threaded functions and problems that can be threaded. Intel MKL is thread-safe, which means that all Intel MKL functions (except the LAPACK deprecated routine ? lacon) work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access for multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Therefore, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, so you can use the environment variable OMP_NUM_THREADS to specify the number of threads or the equivalent OpenMP run-time function calls. Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management. The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither is used, the OpenMP software chooses the default number of threads. By default, Intel MKL uses the number of threads equal to the number of physical cores on the system. To achieve higher performance, set the number of threads to the number of real processors or physical cores, as summarized in Techniques to Set the Number of Threads. See Also Managing Multi-core Performance Threaded Functions and Problems The following Intel MKL function domains are threaded: • Direct sparse solver. • LAPACK. For the list of threaded routines, see Threaded LAPACK Routines. • Level1 and Level2 BLAS. For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines. • All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers. • All mathematical VML functions. • FFT. For the list of FFT transforms that can be threaded, see Threaded FFT Problems. 43Threaded LAPACK Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following LAPACK routines are threaded: • Linear equations, computational routines: • Factorization: ?getrf, ?gbtrf, ?potrf, ?pptrf, ?sytrf, ?hetrf, ?sptrf, ?hptrf • Solving: ?dttrsb, ?gbtrs, ?gttrs, ?pptrs, ?pbtrs, ?pttrs, ?sytrs, ?sptrs, ?hptrs, ? tptrs, ?tbtrs • Orthogonal factorization, computational routines: ?geqrf, ?ormqr, ?unmqr, ?ormlq, ?unmlq, ?ormql, ?unmql, ?ormrq, ?unmrq • Singular Value Decomposition, computational routines: ?gebrd, ?bdsqr • Symmetric Eigenvalue Problems, computational routines: ?sytrd, ?hetrd, ?sptrd, ?hptrd, ?steqr, ?stedc. • Generalized Nonsymmetric Eigenvalue Problems, computational routines: chgeqz/zhgeqz. A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective use of parallelism: ?gesv, ?posv, ?gels, ?gesvd, ?syev, ?heev, cgegs/zgegs, cgegv/zgegv, cgges/zgges, cggesx/zggesx, cggev/zggev, cggevx/zggevx, and so on. Threaded BLAS Level1 and Level2 Routines In the following list, ? stands for a precision prefix of each flavor of the respective routine and may have the value of s, d, c, or z. The following routines are threaded for Intel ® Core™2 Duo and Intel ® Core™ i7 processors: • Level1 BLAS: ?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot • Level2 BLAS: ?gemv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv Threaded FFT Problems The following characteristics of a specific problem determine whether your FFT computation may be threaded: • rank • domain • size/length • precision (single or double) • placement (in-place or out-of-place) • strides • number of transforms • layout (for example, interleaved or split layout of complex data) Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded. Details of which transforms are threaded follow. One-dimensional (1D) transforms 1D transforms are threaded in many cases. 5 Intel® Math Kernel Library for Windows* OS User's Guide 441D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under the following conditions depending on the architecture: Architecture Conditions Intel ® 64 N is a power of 2, log2(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1. IA-32 N is a power of 2, log2(N) > 13, and the transform is single-precision. N is a power of 2, log2(N) > 14, and the transform is double-precision. Any N is composite, log2(N) > 16, and input/output strides equal 1. 1D real-to-complex and complex-to-real transforms are not threaded. 1D complex-to-complex transforms using split-complex layout are not threaded. Prime-size complex-to-complex 1D transforms are not threaded. Multidimensional transforms All multidimensional transforms on large-volume data are threaded. Avoiding Conflicts in the Execution Environment Certain situations can cause conflicts in the execution environment that make the use of threads in Intel MKL problematic. This section briefly discusses why these problems exist and how to avoid them. If you thread the program using OpenMP directives and compile the program with Intel compilers, Intel MKL and the program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads unless you specifically request Intel MKL to do so via the MKL_DYNAMIC functionality. However, Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. If your program is threaded by some other means, Intel MKL may operate in multithreaded mode, and the performance may suffer due to overuse of the resources. The following table considers several cases where the conflicts may arise and provides recommendations depending on your threading model: Threading model Discussion You thread the program using OS threads (Win32* threads on Windows* OS). If more than one thread calls Intel MKL, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel. This is more problematic because setting of the OMP_NUM_THREADS environment variable affects both the compiler's threading library and libiomp. In this case, choose the threading library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is not possible, use Intel MKL in the sequential mode. To do this, you should link with the appropriate threading library: mkl_sequential.lib or mkl_sequential.dll (see High-level Directory Structure). There are multiple programs running on a multiple-cpu system, for example, a parallelized program that runs using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, one of the solutions is to set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). Section Intel(R) Optimized MP LINPACK Benchmark for Clusters discusses another solution for a Hybrid (OpenMP* + MPI) mode. Managing Performance and Memory 5 45TIP To get best performance with threaded Intel MKL, compile your code with the /MT option. See Also Using Additional Threading Control Linking with Compiler Run-time Libraries Techniques to Set the Number of Threads Use one of the following techniques to change the number of threads to use in Intel MKL: • Set one of the OpenMP or Intel MKL environment variables: • OMP_NUM_THREADS • MKL_NUM_THREADS • MKL_DOMAIN_NUM_THREADS • Call one of the OpenMP or Intel MKL functions: • omp_set_num_threads() • mkl_set_num_threads() • mkl_domain_set_num_threads() When choosing the appropriate technique, take into account the following rules: • The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected first. • A function call takes precedence over any environment variables. The exception, which is a consequence of the previous rule, is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. See Using Additional Threading Control for more details. • You cannot change run-time behavior in the course of the run using the environment variables because they are read only once at the first call to Intel MKL. Setting the Number of Threads Using an OpenMP* Environment Variable You can set the number of threads using the environment variable OMP_NUM_THREADS. To change the number of threads, in the command shell in which the program is going to run, enter: set OMP_NUM_THREADS=. Some shells require the variable and its value to be exported: export OMP_NUM_THREADS=. You can alternatively assign value to the environment variable using Microsoft Windows* OS Control Panel. Note that you will not benefit from setting this variable on Microsoft Windows* 98 or Windows* ME because multiprocessing is not supported. See Also Using Additional Threading Control Changing the Number of Threads at Run Time You cannot change the number of threads during run time using environment variables. However, you can call OpenMP API functions from your program to change the number of threads during run time. The following sample code shows how to change the number of threads during run time using the omp_set_num_threads() routine. See also Techniques to Set the Number of Threads. 5 Intel® Math Kernel Library for Windows* OS User's Guide 46The following example shows both C and Fortran code examples. To run this example in the C language, use the omp.h header file from the Intel(R) compiler package. If you do not have the Intel compiler but wish to explore the functionality in the example, use Fortran API for omp_set_num_threads() rather than the C version. For example, omp_set_num_threads_( &i_one ); // ******* C language ******* #include "omp.h" #include "mkl.h" #include #define SIZE 1000 int main(int args, char *argv[]){ double *a, *b, *c; a = (double*)malloc(sizeof(double)*SIZE*SIZE); b = (double*)malloc(sizeof(double)*SIZE*SIZE); c = (double*)malloc(sizeof(double)*SIZE*SIZE); double alpha=1, beta=1; int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i #include ... mkl_set_num_threads ( 1 ); // ******* Fortran language ******* ... call mkl_set_num_threads( 1 ) See the Intel MKL Reference Manual for the detailed description of the threading control functions, their parameters, calling syntax, and more code examples. MKL_DYNAMIC The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads. The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE. When MKL_DYNAMIC is TRUE, Intel MKL tries to use what it considers the best number of threads, up to the maximum number you specify. Managing Performance and Memory 5 49For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following cases: • If the requested number of threads exceeds the number of physical cores (perhaps because of using the Intel® Hyper-Threading Technology), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores. • If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread. When MKL_DYNAMIC is FALSE, Intel MKL tries not to deviate from the number of threads the user requested. However, setting MKL_DYNAMIC=FALSE does not ensure that Intel MKL will use the number of threads that you request. The library may have no choice on this number for such reasons as system resources. Additionally, the library may examine the problem and use a different number of threads than the value suggested. For example, if you attempt to do a size one matrix-matrix multiply across eight threads, the library may instead choose to use only one thread because it is impractical to use eight threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, to use nested parallelism where the library is already called from a parallel section. MKL_DOMAIN_NUM_THREADS The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of threads for a particular function domain. MKL_DOMAIN_NUM_THREADS accepts a string value , which must have the following format: ::= { } ::= [ * ] ( | | | ) [ * ] ::= ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT | MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO ::= [ * ] ( | | ) [ * ] ::= ::= | | In the syntax above, values of indicate function domains as follows: MKL_DOMAIN_ALL All function domains MKL_DOMAIN_BLAS BLAS Routines MKL_DOMAIN_FFT non-cluster Fourier Transform Functions MKL_DOMAIN_VML Vector Mathematical Functions MKL_DOMAIN_PARDISO PARDISO For example, MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4 MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4 MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4 5 Intel® Math Kernel Library for Windows* OS User's Guide 50MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 . The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT, MKL_DOMAIN_VML, and MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in the mkl.h header file. The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted. Value of MKL_DOMAIN_NUM_ THREADS Interpretation MKL_DOMAIN_ALL= 4 All parts of Intel MKL should try four threads. The actual number of threads may be still different because of the MKL_DYNAMIC setting or system resource issues. The setting is equivalent to MKL_NUM_THREADS = 4. MKL_DOMAIN_ALL= 1, MKL_DOMAIN_BLAS =4 All parts of Intel MKL should try one thread, except for BLAS, which is suggested to try four threads. MKL_DOMAIN_VML= 2 VML should try two threads. The setting affects no other part of Intel MKL. Be aware that the domain-specific settings take precedence over the overall ones. For example, the "MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4, MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2. Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows: mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS ); mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT ); Setting the Environment Variables for Threading Control To set the environment variables used for threading control, in the command shell in which the program is going to run, enter : set = For example, set MKL_NUM_THREADS=4 set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" set MKL_DYNAMIC=FALSE Some shells require the variable and its value to be exported: export = For example: export MKL_NUM_THREADS=4 export MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_BLAS=4" export MKL_DYNAMIC=FALSE Managing Performance and Memory 5 51You can alternatively assign values to the environment variables using Microsoft Windows* OS Control Panel. Tips and Techniques to Improve Performance Coding Techniques To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: • Align arrays on 16-byte boundaries. See Aligning Addresses on 16-byte Boundaries for how to do it. • Make sure leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16, where element_size is the size of an array element in bytes. • For two-dimensional arrays, avoid leading dimension values divisible by 2048 bytes. For example, for a double-precision array, with element_size = 8, avoid leading dimensions 256, 512, 768, 1024, … (elements). LAPACK Packed Routines The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see LAPACK "Routine Naming Conventions" sections in the Intel MKL Reference Manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the same positions, but the performance is significantly lower. If the memory restriction is not too tight, use an unpacked routine for better performance. In this case, you need to allocate N 2 /2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations). For example, to speed up solving a symmetric eigenproblem with an expert driver, use the unpacked routine: call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info) where a is the dimension lda-by-n, which is at least N 2 elements, instead of the packed routine: call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info) where ap is the dimension N*(N+1)/2. FFT Functions Additional conditions can improve performance of the FFT functions. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals: • 32 bytes for the Intel ® Pentium® III processors • 64 bytes for the Intel ® Pentium® 4 processors and processors using Intel ® 64 architecture 5 Intel® Math Kernel Library for Windows* OS User's Guide 52Hardware Configuration Tips Dual-Core Intel® Xeon® processor 5100 series systems To get the best performance with Intel MKL on Dual-Core Intel ® Xeon® processor 5100 series systems, enable the Hardware DPL (streaming data) Prefetcher functionality of this processor. To configure this functionality, use the appropriate BIOS settings, as described in your BIOS documentation. Intel® Hyper-Threading Technology Intel ® Hyper-Threading Technology (Intel ® HT Technology) is especially effective when each thread performs different types of operations and when there are under-utilized resources on the processor. However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance by disabling Intel HT Technology. If you run with Intel HT Technology enabled, performance may be especially impacted if you run on fewer threads than physical cores. Moreover, if, for example, there are two threads to every physical core, the thread scheduler may assign two threads to some cores and ignore the other cores altogether. If you are using the OpenMP* library of the Intel Compiler, read the respective User Guide on how to best set the thread affinity interface to avoid this situation. For Intel MKL, apply the following setting: set KMP_AFFINITY=granularity=fine,compact,1,0 See Also Using Parallelism of the Intel® Math Kernel Library Managing Multi-core Performance You can obtain best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. Use one of the following options: • OpenMP facilities (recommended, if available), for example, the KMP_AFFINITY environment variable using the Intel OpenMP library • A system function, as explained below Consider the following performance issue: • The system has two sockets with two cores each, for a total of four cores (CPUs) • Performance of t he four -thread parallel application using the Intel MKL LAPACK is unstable The following code example shows how to resolve this issue by setting an affinity mask by operating system means using the Intel compiler. The code calls the system function SetThreadAffinityMask to bind the threads to appropriate cores , thus preventing migration of the threads. Then the Intel MKL LAPACK routine is called: // Set affinity mask #include #include int main(void) { #pragma omp parallel default(shared) { int tid = omp_get_thread_num(); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) DWORD_PTR mask = (1 << (tid == 0 ? 0 : 2 )); SetThreadAffinityMask( GetCurrentThread(), mask ); } // Call Intel MKL LAPACK routine return 0; Managing Performance and Memory 5 53 } Compile the application with the Intel compiler using the following command: icl /Qopenmp test_application.c where test_application.c is the filename for the application. Build the application. Run it in four threads, for example, by using the environment variable to set the number of threads: set OMP_NUM_THREADS=4 test_application.exe See Windows API documentation at msdn.microsoft.com/ for the restrictions on the usage of Windows API routines and particulars of the SetThreadAffinityMask function used in the above example. See also a similar example at en.wikipedia.org/wiki/Affinity_mask . Operating on Denormals The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a specific floating-point format. Floating-point operations on denormals are slower than on normalized operands because denormal operands and results are usually handled through a software assist mechanism rather than directly in hardware. This software processing causes Intel MKL functions that consume denormals to run slower than with normalized floating-point numbers. You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero (DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ. Note that these compiler options may slightly affect accuracy. FFT Optimized Radices You can improve the performance of Intel MKL FFT if the length of your data vector permits factorization into powers of optimized radices. In Intel MKL, the optimized radices are 2, 3, 5, 7, 11, and 13. Using Memory Management Intel MKL Memory Management Software Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak. The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes. 5 Intel® Math Kernel Library for Windows* OS User's Guide 54Redefining Memory Functions In C/C++ programs, you can replace Intel MKL memory functions that the library uses by default with your own functions. To do this, use the memory renaming feature. Memory Renaming Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming. Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc, which are visible at the application level. These pointers initially hold addresses of the standard C run-time memory functions malloc, free, calloc, and realloc, respectively. You can programmatically redefine values of these pointers to the addresses of your application's memory management functions. Redirecting the pointers is the only correct way to use your own set of memory management functions. If you call your own memory functions without redirecting the pointers, the memory will get managed by two independent memory management packages, which may cause unexpected memory issues. How to Redefine Memory Functions To redefine memory functions, use the following procedure: If you are using the statically linked Intel MKL, 1. Include the i_malloc.h header file in your code. This header file contains all declarations required for replacing the memory allocation functions. The header file also describes how memory allocation can be replaced in those Intel libraries that support this feature. 2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc = my_malloc; i_calloc = my_calloc; i_realloc = my_realloc; i_free = my_free; . . . // Now you may call Intel MKL functions If you are using the dynamically linked Intel MKL, 1. Include the i_malloc.h header file in your code. 2. Redefine values of pointers i_malloc_dll, i_free_dll, i_calloc_dll, and i_realloc_dll prior to the first call to MKL functions, as shown in the following example: #include "i_malloc.h" . . . i_malloc_dll = my_malloc; i_calloc_dll = my_calloc; i_realloc_dll = my_realloc; i_free_dll = my_free; . . . // Now you may call Intel MKL functions Managing Performance and Memory 5 555 Intel® Math Kernel Library for Windows* OS User's Guide 56Language-specific Usage Options 6 The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming. However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have no C interface. You can call such functions from C using mixed-language programming. If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment, additional effort may be initially required to build compiler-specific interface libraries and modules from the source code provided with Intel MKL. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Using Language-Specific Interfaces with Intel® Math Kernel Library This section discusses mixed-language programming and the use of language-specific interfaces with Intel MKL. See also Appendix G in the Intel MKL Reference Manual for details of the FFTW interfaces to Intel MKL. Interface Libraries and Modules You can create the following interface libraries and modules using the respective makefiles located in the interfaces directory. File name Contains Libraries, in Intel MKL architecture-specific directories mkl_blas95.lib 1 Fortran 95 wrappers for BLAS (BLAS95) for IA-32 architecture. mkl_blas95_ilp64.lib 1 Fortran 95 wrappers for BLAS (BLAS95) supporting LP64 interface. mkl_blas95_lp64.lib 1 Fortran 95 wrappers for BLAS (BLAS95) supporting ILP64 interface. mkl_lapack95.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) for IA-32 architecture. mkl_lapack95_lp64.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting LP64 interface. mkl_lapack95_ilp64.lib 1 Fortran 95 wrappers for LAPACK (LAPACK95) supporting ILP64 interface. 57File name Contains fftw2xc_intel.lib 1 Interfaces for FFTW version 2.x (C interface for Intel compilers) to call Intel MKL FFTs. fftw2xc_ms.lib Contains interfaces for FFTW version 2.x (C interface for Microsoft compilers) to call Intel MKL FFTs. fftw2xf_intel.lib Interfaces for FFTW version 2.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. fftw3xc_intel.lib 2 Interfaces for FFTW version 3.x (C interface for Intel compiler) to call Intel MKL FFTs. fftw3xc_ms.lib Interfaces for FFTW version 3.x (C interface for Microsoft compilers) to call Intel MKL FFTs. fftw3xf_intel.lib 2 Interfaces for FFTW version 3.x (Fortran interface for Intel compilers) to call Intel MKL FFTs. fftw2x_cdft_SINGLE.lib Single-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. fftw2x_cdft_DOUBLE.lib Double-precision interfaces for MPI FFTW version 2.x (C interface) to call Intel MKL cluster FFTs. fftw3x_cdft.lib Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs. fftw3x_cdft_ilp64.lib Interfaces for MPI FFTW version 3.x (C interface) to call Intel MKL cluster FFTs supporting the ILP64 interface. Modules, in architecture- and interface-specific subdirectories of the Intel MKL include directory blas95.mod 1 Fortran 95 interface module for BLAS (BLAS95). lapack95.mod 1 Fortran 95 interface module for LAPACK (LAPACK95). f95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95. mkl95_blas.mod 1 Fortran 95 interface module for BLAS (BLAS95), identical to blas95.mod. To be removed in one of the future releases. mkl95_lapack.mod 1 Fortran 95 interface module for LAPACK (LAPACK95), identical to lapack95.mod. To be removed in one of the future releases. mkl95_precision.mod 1 Fortran 95 definition of precision parameters for BLAS95 and LAPACK95, identical to f95_precision.mod. To be removed in one of the future releases. mkl_service.mod 1 Fortran 95 interface module for Intel MKL support functions. 1 Prebuilt for the Intel® Fortran compiler 2 FFTW3 interfaces are integrated with Intel MKL. Look into \interfaces\fftw3x* \makefile for options defining how to build and where to place the standalone library with the wrappers. See Also Fortran 95 Interfaces to LAPACK and BLAS 6 Intel® Math Kernel Library for Windows* OS User's Guide 58Fortran 95 Interfaces to LAPACK and BLAS Fortran 95 interfaces are compiler-dependent. Intel MKL provides the interface libraries and modules precompiled with the Intel® Fortran compiler. Additionally, the Fortran 95 interfaces and wrappers are delivered as sources. (For more information, see Compiler-dependent Functions and Fortran 90 Modules). If you are using a different compiler, build the appropriate library and modules with your compiler and link the library as a user's library: 1. Go to the respective directory \interfaces\blas95 or \interfaces\lapack95 2. Type one of the following commands depending on your architecture: • For the IA-32 architecture, nmake libia32 install_dir= • For the Intel® 64 architecture, nmake libintel64 [interface=lp64|ilp64] install_dir= Important The parameter install_dir is required. As a result, the required library is built and installed in the \lib directory, and the .mod files are built and installed in the \include\[\{lp64|ilp64}] directory, where is one of {ia32, intel64}. By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of nmake: FC=. For example, the command nmake libintel64 FC=f95 install_dir= interface=lp64 builds the required library and .mod files and installs them in subdirectories of . To delete the library from the building directory, use one of the following commands: • For the IA-32 architecture, nmake cleania32 install_dir= • For the Intel ® 64 architecture, nmake cleanintel64 [interface=lp64|ilp64] install_dir= • For all the architectures, nmake clean install_dir= CAUTION Even if you have administrative rights, avoid setting install_dir=..\.. or install_dir= in a build or clean command above because these settings replace or delete the Intel MKL prebuilt Fortran 95 library and modules. Compiler-dependent Functions and Fortran 90 Modules Compiler-dependent functions occur whenever the compiler inserts into the object code function calls that are resolved in its run-time library (RTL). Linking of such code without the appropriate RTL will result in undefined symbols. Intel MKL has been designed to minimize RTL dependencies. In cases where RTL dependencies might arise, the functions are delivered as source code and you need to compile the code with whatever compiler you are using for your application. Language-specific Usage Options 6 59In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support. Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be used with different compilers. Using the stdcall Calling Convention in C/C++ Intel MKL supports stdcall calling convention for the following function domains: • BLAS Routines • Sparse BLAS Routines • LAPACK Routines • Vector Mathematical Functions • Vector Statistical Functions • PARDISO • Direct Sparse Solvers • RCI Iterative Solvers • Support Functions To use the stdcall calling convention in C/C++, follow the guidelines below: • In your function calls, pass lengths of character strings to the functions. For example, compare the following calls to dgemm: cdecl: dgemm("N", "N", &n, &m, &k, &alpha, b, &ldb, a, &lda, &beta, c, &ldc); stdcall: dgemm("N", 1, "N", 1, &n, &m, &k, &alpha, b, &ldb, a, &lda, &beta, c, &ldc); • Define the MKL_STDCALL macro using either of the following techniques: – Define the macro in your source code before including Intel MKL header files: ... #define MKL_STDCALL #include "mkl.h" ... – Pass the macro to the compiler. For example: icl -DMKL_STDCALL foo.c • Link your application with the following library: – mkl_intel_s.lib for static linking – mkl_intel_s_dll.lib for dynamic linking See Also Using the cdecl and stdcall Interfaces Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions Include Files Compiling an Application that Calls the Intel® Math Kernel Library and Uses the CVF Calling Conventions The IA-32 architecture implementation of Intel MKL supports the Compaq Visual Fortran* (CVF) calling convention by providing the stdcall interface. 6 Intel® Math Kernel Library for Windows* OS User's Guide 60Although the Intel MKL does not provide the CVF interface in its Intel® 64 architecture implementation, you can use the Intel® Visual Fortran Compiler to compile your Intel® 64 architecture application that calls Intel MKL and uses the CVF calling convention. To do this: • Provide the following compiler options to enable compatibility with the CVF calling convention: /Gm or /iface:cvf • Additionally provide the following options to enable calling Intel MKL from your application: /iface:nomixed_str_len_arg See Also Using the cdecl and stdcall Interfaces Compiler Support Mixed-language Programming with the Intel Math Kernel Library Appendix A: Intel(R) Math Kernel Library Language Interfaces Support lists the programming languages supported for each Intel MKL function domain. However, you can call Intel MKL routines from different language environments. Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK and BLAS in the subsections below. CAUTION Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL. LAPACK and BLAS Because LAPACK and BLAS routines are Fortran-style, when calling them from C-language programs, follow the Fortran-style calling conventions: • Pass variables by address, not by value. Function calls in Example "Calling a Complex BLAS Level 1 Function from C++" and Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrate this. • Store your data in Fortran style, that is, column-major rather than row-major order. With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style columnmajor order, the last index changes most slowly whereas the first index changes most quickly (as illustrated by the figure below for a two-dimensional array). Language-specific Usage Options 6 61For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you can access a matrix element like this: A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1) A(i,j) = B(j*m+i) in Fortran ( i=1, ... , m, j=1, ... , n). When calling LAPACK or BLAS routines from C, be aware that because the Fortran language is caseinsensitive, the routine names can be both upper-case or lower-case, with or without the trailing underscore. For example, the following names are equivalent: • LAPACK: dgetrf, DGETRF, dgetrf_, and DGETRF_ • BLAS: dgemm, DGEMM, dgemm_, and DGEMM_ See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C. See also the Intel(R) MKL Reference Manual for a description of the C interface to LAPACK functions. CBLAS Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface. CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" illustrates the use of the CBLAS interface. C Interface to LAPACK Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK provided by Intel MKL. The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in row-major and column-major order, which you can define in the first function argument matrix_order. Use the mkl_lapacke.h header file with the C interface to LAPACK. The header file specifies constants and prototypes of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You can find examples of the C interface to LAPACK in the examples\lapacke subdirectory in the Intel MKL installation directory. Using Complex Types in C/C++ As described in the documentation for the Intel® Visual Fortran Compiler XE, C/C++ does not directly implement the Fortran types COMPLEX(4) and COMPLEX(8). However, you can write equivalent structures. The type COMPLEX(4) consists of two 4-byte floating-point numbers. The first of them is the real-number component, and the second one is the imaginary-number component. The type COMPLEX(8) is similar to COMPLEX(4) except that it contains two 8-byte floating-point numbers. Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define complex data. You can also redefine the types with your own types before including the mkl_types.h header file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the complex type must be a pair of real numbers for the values of real and imaginary parts. For example, you can use the following definitions in your C++ code: #define MKL_Complex8 std::complex and #define MKL_Complex16 std::complex 6 Intel® Math Kernel Library for Windows* OS User's Guide 62See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these types in the command line: -DMKL_Complex8="std::complex" -DMKL_Complex16="std::complex" See Also Intel® Software Documentation Library Calling BLAS Functions that Return the Complex Values in C/C++ Code Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style, you need to be careful when handling a call from C to a BLAS function that returns complex values. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value is the first parameter in the calling sequence. You can use this feature to call a BLAS function from C. The following example shows how a call to a Fortran function as a subroutine converts to a call from C and the hidden parameter result gets exposed: Normal Fortran function call: result = cdotc( n, x, 1, y, 1 ) A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1) A call to the function from C: cdotc( &result, &n, x, &one, y, &one ) NOTE Intel MKL has both upper-case and lower-case entry points in the Fortran-style (caseinsensitive) BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable: cdotc, CDOTC, cdotc_, and CDOTC_. The above example shows one of the ways to call several level 1 BLAS functions that return complex values from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows: cblas_cdotu( n, x, 1, y, 1, &result ) NOTE The complex value comes last on the argument list in this case. The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS (C language) interface: • Example "Calling a Complex BLAS Level 1 Function from C" • Example "Calling a Complex BLAS Level 1 Function from C++" • Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" Example "Calling a Complex BLAS Level 1 Function from C" The example below illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c. #include "mkl.h" #define N 5 int main() { int n = N, inca = 1, incb = 1, i; MKL_Complex16 a[N], b[N], c; for( i = 0; i < n; i++ ){ a[i].real = (double)i; a[i].imag = (double)i * 2.0; b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0; Language-specific Usage Options 6 63} zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag ); return 0; } Example "Calling a Complex BLAS Level 1 Function from C++" Below is the C++ implementation: #include #include #define MKL_Complex16 std::complex #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i] = std::complex(i,i*2.0); b[i] = std::complex(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; } Example "Using CBLAS Interface Instead of Calling BLAS Directly from C" This example uses CBLAS: #include #include "mkl.h" typedef struct{ double re; double im; } complex16; #define N 5 int main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb, &c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); return 0; } Support for Boost uBLAS Matrix-matrix Multiplication If you are used to uBLAS, you can perform BLAS matrix-matrix multiplication in C++ using Intel MKL substitution of Boost uBLAS functions. uBLAS is the Boost C++ open-source library that provides BLAS functionality for dense, packed, and sparse matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS provides two modes: • Debug (safe) mode, default. 6 Intel® Math Kernel Library for Windows* OS User's Guide 64Checks types and conformance. • Release (fast) mode. Does not check types and conformance. To enable this mode, use the NDEBUG preprocessor symbol. The documentation for the Boost uBLAS is available at www.boost.org. Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them: • Include the header file mkl_boost_ublas_matrix_prod.hpp in your code (from the Intel MKL include directory) • Add appropriate Intel MKL libraries to the link line. The list of expressions that are substituted follows: prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) ) prod( conj(trans(m1)), conj(trans(m2)) ) These expressions are substituted in the release mode only (with NDEBUG preprocessor symbol defined). Supported uBLAS versions are Boost 1.34.1 and higher. To get them, visit www.boost.org. A code example provided in the \examples\ublas\source\sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation. To run the Intel MKL ublas examples, specify the boost_root parameter in the n make command, for instance, when using Boost version 1.37.0: nmake libia32 boost_root = \boost_1_37_0 Intel MKL ublas examples on default Boost uBLAS configuration support only: • Microsoft Visual C++* Compiler versions 2005 and higher • Intel C++ Compiler versions 11.1 and higher with Microsoft Visual Studio IDE versions 2005 and higher See Also Using Code Examples Invoking Intel MKL Functions from Java* Applications Language-specific Usage Options 6 65Intel MKL Java* Examples To demonstrate binding with Java, Intel MKL includes a set of Java examples in the following directory: \examples\java. The examples are provided for the following MKL functions: • ?gemm, ?gemv, and ?dot families from CBLAS • The complete set of non-cluster FFT functions • ESSL 1 -like functions for one-dimensional convolution and correlation • VSL Random Number Generators (RNG), except user-defined ones and file subroutines • VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack You can see the example sources in the following directory: \examples\java\examples. The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: • 1- and 2-dimensional data sequences • Real and complex types of the data • Single and double precision However, the wrappers, used in the examples, do not: • Demonstrate the use of large arrays (>2 billion elements) • Demonstrate processing of arrays in native memory • Check correctness of function parameters • Demonstrate performance optimizations The examples use the Java Native Interface (JNI* developer framework) to bind with Intel MKL. The JNI documentation is available from http://java.sun.com/javase/6/docs/technotes/guides/jni/. The Java example set includes JNI wrappers that perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples. The makefile builds the wrapper binaries. After running the makefile, you can run the examples, which will determine whether the wrappers were built correctly. As a result of running the examples, the following directories will be created in \examples \java: • docs • include • classes • bin • _results The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers are the following Java classes: • com.intel.mkl.CBLAS • com.intel.mkl.DFTI • com.intel.mkl.ESSL • com.intel.mkl.VML • com.intel.mkl.VSL 6 Intel® Math Kernel Library for Windows* OS User's Guide 66Documentation for the particular wrapper and example classes will be generated from the Java sources while building and running the examples. To browse the documentation, open the index file in the docs directory (created by the build script): \examples\java\docs\index.html. The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions, so you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory: \examples\java\wrappers. Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may use to cover additional CBLAS functions. The wrapper for FFT is more complicated because it needs to support the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor, while the virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes a similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally encapsulates the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. The examples and the Java part of the wrappers are written for the Java language described in "The Java Language Specification (First Edition)" and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of the Sun Java Development Kit* (JDK*) developer toolkit and compatible implementations starting from version 1.1.5, or by all modern versions of Java. The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floatingpoint data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types must be the same as JNI jfloat and jdouble data types, respectively, and the native int must be 4 bytes long. 1 IBM Engineering Scientific Subroutine Library (ESSL*). See Also Running the Java* Examples Running the Java* Examples The Java examples support all the C and C++ compilers that Intel MKL does. The makefile intended to run the examples also needs the n make utility, which is typically provided with the C/C++ compiler package. To run Java examples, the JDK* developer toolkit is required for compiling and running Java code. A Java implementation must be installed on the computer or available via the network. You may download the JDK from the vendor website. The examples should work for all versions of JDK. However, they were tested only with the following Java implementation s for all the supported architectures: • J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc. (http://sun.com/). • JRockit* JDK 1.4.2 and 5.0 from Oracle Corporation (http://oracle.com/). Note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: Language-specific Usage Options 6 67• java • javac • javah • javadoc To make these tools available for the examples makefile, set the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example : SET JAVA_HOME=C:\Program Files\Java\jdk1.5.0_09 SET PATH=%JAVA_HOME%\bin;%PATH% You may also need to clear the JDK_HOME environment variable, if it is assigned a value: SET JDK_HOME= To start the examples, use the makefile found in the Intel MKL Java examples directory: nmake {dllia32|dllintel64|libia32|libintel64} [function=...] [compiler=...] If you type the make command and omit the target (for example, dllia32), the makefile prints the help info, which explains the targets and parameters. For the examples list, see the examples.lst file in the Java examples directory. Known Limitations of the Java* Examples This section explains limitations of Java examples. Functionality Some Intel MKL functions may fail to work if called from the Java environment by using a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java Examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/ correlation functions in your Java applications. Performance The Intel MKL functions must work faster than similar functions written in pure Java. However, the main goal of these wrappers is to provide code examples, not maximum performance. So, an Intel MKL function called from a Java application will probably work slower than the same function called from a program written in C/ C++ or Fortran. Known bugs There are a number of known bugs in Intel MKL (identified in the Release Notes), as well as incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems. Look at the source code in the examples and wrappers for comments that describe the workarounds. 6 Intel® Math Kernel Library for Windows* OS User's Guide 68Coding Tips 7 This section discusses programming with the Intel® Math Kernel Library (Intel® MKL) to provide coding tips that meet certain, specific needs, such as consistent results of computations or conditional compilation. Aligning Data for Consistent Results Routines in Intel MKL may return different results from run-to-run on the same system. This is usually due to a change in the order in which floating-point operations are performed. The two most influential factors are array alignment and parallelism. Array alignment can determine how internal loops order floating-point operations. Non-deterministic parallelism may change the order in which computational tasks are executed. While these results may differ, they should still fall within acceptable computational error bounds. To better assure identical results from run-to-run, do the following: • Align input arrays on 16-byte boundaries • Run Intel MKL in the sequential mode To align input arrays on 16-byte boundaries, use mkl_malloc() in place of system provided memory allocators, as shown in the code example below. Sequential mode of Intel MKL removes the influence of nondeterministic parallelism. Aligning Addresses on 16-byte Boundaries // ******* C language ******* ... #include ... void *darray; int workspace; ... // Allocate workspace aligned on 16-byte boundary darray = mkl_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace mkl_free( darray ); ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-byte boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk) 69Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation Preprocessor symbols (macros) substitute values in a program before it is compiled. The substitution is performed in the preprocessing phase. The following preprocessor symbols are available: Predefined Preprocessor Symbol Description __INTEL_MKL__ Intel MKL major version __INTEL_MKL_MINOR__ Intel MKL minor version __INTEL_MKL_UPDATE__ Intel MKL update number INTEL_MKL_VERSION Intel MKL full version in the following format: INTEL_MKL_VERSION = (__INTEL_MKL__*100+__INTEL_MKL_MINOR__)*100+__I NTEL_MKL_UPDATE__ These symbols enable conditional compilation of code that uses new features introduced in a particular version of the library. To perform conditional compilation: 1. Include in your code the file where the macros are defined: • mkl.h for C/C++ • mkl.fi for Fortran 2. [Optionally] Use the following preprocessor directives to check whether the macro is defined: • #ifdef, #endif for C/C++ • !DEC$IF DEFINED, !DEC$ENDIF for Fortran 3. Use preprocessor directives for conditional inclusion of code: • #if, #endif for C/C++ • !DEC$IF, !DEC$ENDIF for Fortran Example Compile a part of the code if Intel MKL version is MKL 10.3 update 4: C/C++: #include "mkl.h" #ifdef INTEL_MKL_VERSION #if INTEL_MKL_VERSION == 100304 // Code to be conditionally compiled #endif #endif Fortran: include "mkl.fi" !DEC$IF DEFINED INTEL_MKL_VERSION !DEC$IF INTEL_MKL_VERSION .EQ. 100304 * Code to be conditionally compiled !DEC$ENDIF !DEC$ENDIF 7 Intel® Math Kernel Library for Windows* OS User's Guide 70Working with the Intel® Math Kernel Library Cluster Software 8 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 MPI Support Intel MKL ScaLAPACK and Cluster FFTs support MPI implementations identified in the Intel® Math Kernel Library (Intel® MKL) Release Notes. To link applications with ScaLAPACK or Cluster FFTs, you need to configure your system depending on your message-passing interface (MPI) implementation as explained below. If you are using MPICH2, do the following: 1. Add mpich2\include to the include path (assuming the default MPICH2 installation). 2. Add mpich2\lib to the library path. 3. Add mpi.lib to your link command. 4. Add fmpich2.lib to your Fortran link command. 5. Add cxx.lib to your Release target link command and cxxd.lib to your Debug target link command for C++ programs. If you are using the Microsoft MPI, do the following: 1. Add Microsoft Compute Cluster Pack\include to the include path (assuming the default installation of the Microsoft MPI). 2. Add Microsoft Compute Cluster Pack\Lib\AMD64 to the library path. 3. Add msmpi.lib to your link command. If you are using the Intel® MPI, do the following: 1. Add the following string to the include path: %ProgramFiles%\Intel\MPI\\\include, where is the directory for a particular MPI version and is ia32 or intel64, for example, %ProgramFiles%\Intel\MPI\3.1\intel64\include. 2. Add the following string to the library path: %ProgramFiles%\Intel\MPI\\\lib, for example, %ProgramFiles%\Intel\MPI\3.1\intel64\lib. 3. Add impi.lib and impicxx.lib to your link command. Check the documentation that comes with your MPI implementation for implementation-specific details of linking. Linking with ScaLAPACK and Cluster FFTs To link with Intel MKL ScaLAPACK and/or Cluster FFTs, use the following commands : 71set lib =;;%lib% where the placeholders stand for paths and libraries as explained in the following table: \lib\{ia32|intel64}, depending on your architecture. If you performed the Setting Environment Variables step of the Getting Started process, you do not need to add this directory to the lib environment variable. Typically the lib subdirectory in the MPI installation directory. For example, C:\Program Files (x86)\Intel\MPI\3.2.0.005\ia32\lib for a default installation of Intel MPI 3.2. One of icl, ifort, xilink. One of ScaLAPACK or Cluster FFT libraries for the appropriate architecture, which are listed in Directory Structure in Detail. For example, for the IA-32 architecture, it is one of mkl_scalapack_core.lib or mkl_cdft_core.lib. The BLACS library corresponding to your architecture, programming interface (LP64 or ILP64), and MPI version. These libraries are listed in Directory Structure in Detail. For example, for the IA-32 architecture, choose one of mkl_blacs_mpich2.lib or mkl_blacs_intelmpi.lib in case of static linking or mkl_blacs_dll.lib in case of dynamic linking; specifically, for MPICH2, choose mkl_blacs_mpich2.lib in case of static linking. Intel MKL libraries other than ScaLAPACK or Cluster FFTs libraries. TIP Use the Link-line Advisor to quickly choose the appropriate set of , , and . Intel MPI provides prepackaged scripts for its linkers to help you link using the respective linker. Therefore, if you are using Intel MPI, the best way to link is to use the following commands: \mpivars.bat set lib = ;%lib% where the placeholders that are not yet defined are explained in the following table: 8 Intel® Math Kernel Library for Windows* OS User's Guide 72 By default, the bin subdirectory in the MPI installation directory. For example, C: \Program Files (x86)\Intel\MPI\3.2.0.005\ia32\lib for a default installation of Intel MPI 3.2; mpicl or mpiifort See Also Linking Your Application with the Intel® Math Kernel Library Examples for Linking with ScaLAPACK and Cluster FFT Determining the Number of Threads The OpenMP* software responds to the environment variable OMP_NUM_THREADS. Intel MKL also has other mechanisms to set the number of threads, such as the MKL_NUM_THREADS or MKL_DOMAIN_NUM_THREADS environment variables (see Using Additional Threading Control). Make sure that the relevant environment variables have the same and correct values on all the nodes. Intel MKL versions 10.0 and higher no longer set the default number of threads to one, but depend on the OpenMP libraries used with the compiler to set the default number. For the threading layer based on the Intel compiler (mkl_intel_thread.lib), this value is the number of CPUs according to the OS. CAUTION Avoid over-prescribing the number of threads, which may occur, for instance, when the number of MPI ranks per node and the number of threads per node are both greater than one. The product of MPI ranks per node and the number of threads per node should not exceed the number of physical cores per node. The OMP_NUM_THREADS environment variable is assumed in the discussion below. Set OMP_NUM_THREADS so that the product of its value and the number of MPI ranks per node equals the number of real processors or cores of a node. If the Intel ® Hyper-Threading Technology is enabled on the node, use only half number of the processors that are visible on Windows OS. See Also Setting Environment Variables on a Cluster Using DLLs All the needed DLLs must be visible on all the nodes at run time, and you should install Intel® Math Kernel Library (Intel® MKL) on each node of the cluster. You can use Remote Installation Services (RIS) provided by Microsoft to remotely install the library on each of the nodes that are part of your cluster. The best way to make the DLLs visible is to point to these libraries in the PATH environment variable. See Setting Environment Variables on a Cluster on how to set the value of the PATH environment variable. The ScaLAPACK DLLs for the IA-32 and Intel® 64 architectures (in the \redist \ia32\mkl and \redist\intel64\mkl directories, respectively) use the MPI dispatching mechanism. MPI dispatching is based on the MKL_BLACS_MPI environment variable. The BLACS DLL uses MKL_BLACS_MPI for choosing the needed MPI libraries. The table below lists possible values of the variable. Value Comment MPICH2 Default value. MPICH2 1.0.x for Windows* OS is used for message passing INTELM PI Intel MPI is used for message passing Working with the Intel® Math Kernel Library Cluster Software 8 73Value Comment MSMPI Microsoft MPI is used for message passing If you are using a non-default MPI, assign the same appropriate value to MKL_BLACS_MPI on all nodes. See Also Setting Environment Variables on a Cluster Setting Environment Variables on a Cluster If you are using MPICH2 or Intel MPI, to set an environment variable on the cluster, use -env, -genv, - genvlist keys of mpiexec. See the following MPICH2 examples on how to set the value of OMP_NUM_THREADS: mpiexec -genv OMP_NUM_THREADS 2 .... mpiexec -genvlist OMP_NUM_THREADS .... mpiexec -n 1 -host first -env OMP_NUM_THREADS 2 test.exe : -n 1 -host second -env OMP_NUM_THREADS 3 test.exe .... See the following Intel MPI examples on how to set the value of MKL_BLACS_MPI: mpiexec -genv MKL_BLACS_MPI INTELMPI .... mpiexec -genvlist MKL_BLACS_MPI .... mpiexec -n 1 -host first -env MKL_BLACS_MPI INTELMPI test.exe : -n 1 -host second -env MKL_BLACS_MPI INTELMPI test.exe. When using MPICH2, you may have problems with getting the global environment, such as MKL_BLACS_MPI, by the -genvlist key. In this case, set up user or system environments on each node as follows: From the Start menu, select Settings > Control Panel > System > Advanced > Environment Variables. If you are using Microsoft MPI, the above ways of setting environment variables are also applicable if the Microsoft Single Program Multiple Data (SPMD) process managers are running in a debug mode on all nodes of the cluster. However, the best way to set environment variables is using the Job Scheduler with the Microsoft Management Console (MMC) and/or the Command Line Interface (CLI) to submit a job and pass environment variables. For more information about MMC and CLI, see the Microsoft Help and Support page at the Microsoft Web site (http://www.microsoft.com/). Building ScaLAPACK Tests To build ScaLAPACK tests, • For the IA-32 architecture, add mkl_scalapack_core.lib to your link command. • For the Intel® 64 architecture, add mkl_scalapack_lp64.lib or mkl_scalapack_ilp64.lib, depending on the desired interface. Examples for Linking with ScaLAPACK and Cluster FFT This section provides examples of linking with ScaLAPACK and Cluster FFT. Note that a binary linked with ScaLAPACK runs the same way as any other MPI application (refer to the documentation that comes with your MPI implementation). For further linking examples, see the support website for Intel products at http://www.intel.com/software/ products/support/. 8 Intel® Math Kernel Library for Windows* OS User's Guide 74See Also Directory Structure in Detail Examples for Linking a C Application These examples illustrate linking of an application whose main module is in C under the following conditions: • MPICH2 1.0.x is installed in c:\mpich2x64. • You use the Intel® C++ Compiler 10.0 or higher. To link with ScaLAPACK using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib=c:\mpich2x64\lib;\lib\intel64;%lib% icl mkl_scalapack_lp64.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mpi.lib cxx.lib bufferoverflowu.lib To link with Cluster FFT using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib=c:\mpich2x64\lib;\lib\intel64;%lib% icl mkl_cdft_core.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mpi.lib cxx.lib bufferoverflowu.lib See Also Linking with ScaLAPACK and Cluster FFTs Linking with System Libraries Examples for Linking a Fortran Application These examples illustrate linking of an application whose main module is in Fortran under the following conditions: • Microsoft Windows Compute Cluster Pack SDK is installed in c:\MS CCP SDK. • You use the Intel® Fortran Compiler 10.0 or higher. To link with ScaLAPACK using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib="c:\MS CCP SDK\Lib\AMD64";\lib\intel64;%lib% ifort mkl_scalapack_lp64.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib msmpi.lib bufferoverflowu.lib To link with Cluster FFTs using LP64 interface for a cluster of Intel® 64 architecture based systems, set the environment variable and use the link line as follows: set lib="c:\MS CCP SDK\Lib\AMD64";\lib\intel64;%lib% ifort mkl_cdft_core.lib mkl_blacs_mpich2_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib msmpi.lib bufferoverflowu.lib See Also Linking with ScaLAPACK and Cluster FFTs Linking with System Libraries Working with the Intel® Math Kernel Library Cluster Software 8 758 Intel® Math Kernel Library for Windows* OS User's Guide 76Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 Configuring Your Integrated Development Environment to Link with Intel Math Kernel Library Configuring the Microsoft Visual C/C++* Development System to Link with Intel® MKL Steps for configuring Microsoft Visual C/C++* Development System for linking with Intel® Math Kernel Library (Intel® MKL) depend on whether If you installed the C++ Integration(s) in Microsoft Visual Studio* component of the Intel® Composer XE: • If you installed the integration component, see Automatically Linking Your Microsoft Visual C/C++ Project with Intel MKL. • If you did not install the integration component or need more control over Intel MKL libraries to link, you can configure the Microsoft Visual C++* 2005, Visual C++* 2008, or Visual C++* 2010 development system by performing the following steps. Though some versions of the Visual C++* development system may vary slightly in the menu items mentioned below, the fundamental configuring steps are applicable to all these versions. 1. From the menu, select View > Solution Explorer (and make sure this window is active) 2. Select Tools > Options > Projects > VC++ Directories 3. From the Show directories for list, select Include Files. Add the directory for the Intel MKL include files, that is, \include 4. From the Show directories for list, select Library Files. Add architecture-specific directories for Intel MKL and OpenMP* libraries, for example: \lib\ia32 and \compiler\lib\ia32 5. From the Show directories for list, select Executable Files. Add architecture-specific directories with dynamic-link libraries: • For OpenMP* support, for example: \redist\ia32\compiler • For Intel MKL (only if you link dynamically), for example: \redist \ia32\mkl 6. Select Project>Properties>Configuration Properties>Linker>Input>Additional Dependencies. Add the libraries required, for example, mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib See Also Intel® Software Documentation Library Linking in Detail Configuring Intel® Visual Fortran to Link with Intel MKL Steps for configuring Intel® Visual Fortran for linking with Intel® Math Kernel Library (Intel® MKL) depend on whether you installed the Visual Fortran Integration(s) in Microsoft Visual Studio* component of the Intel® Composer XE: • If you installed the integration component, see Automatically Linking Your Intel® Visual Fortran Project with Intel® MKL. 77• If you did not install the integration component or need more control over Intel MKL libraries to link, you can configure your project as follows: 1. Select Project>Properties>Linker>General>Additional Library Directories. Add architecturespecific directories for Intel MKL and OpenMP* libraries, for example: \lib\ia32 and \compiler\lib\ia32 2. Select Project>Properties>Linker>Input>Additional Dependencies. Insert names of the required libraries, for example: mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib 3. Select Project>Properties>Debugging>Environment. Add architecture-specific paths to dynamiclink libraries: • For OpenMP* support; for example: enter PATH=%PATH%;\redist \ia32\compiler • For Intel MKL (only if you link dynamically); for example: enter PATH=%PATH%;\redist\ia32\mkl See Also Intel® Software Documentation Library Running an Intel MKL Example in the Visual Studio* 2008 IDE This section explains how to create and configure projects with the Intel® Math Kernel Library (Intel® MKL) examples in Microsoft Visual Studio* 2008. For Intel MKL examples where the instructions below do not work, see Known Limitations. To run the Intel MKL C examples in Microsoft Visual Studio 2008: 1. Do either of the following: • Install Intel® C/C++ Compiler and integrate it into Visual Studio (recommended). • Use the Microsoft Visual C++* 2008 Compiler integrated into Visual Studio*. 2. Create, configure, and run the Intel C/C++ and/or Microsoft Visual C++* 2008. To run the Intel MKL Fortran examples in Microsoft Visual Studio 2008: 1. Install Intel® Visual Fortran Compiler and integrate it into Visual Studio. The default installation of the Intel Visual Fortran Compiler performs this integration. For more information, see the Intel Visual Fortran Compiler documentation. 2. Create, configure, and run the Intel Visual Fortran project. Creating, Configuring, and Running the Intel® C/C++ and/or Visual C++* 2008 Project This section demonstrates how to create a Visual C/C++ project using an Intel® Math Kernel Library (Intel® MKL) example in Microsoft Visual Studio 2008. The instructions below create a Win32/Debug project running one Intel MKL example in a Console window. For details on creation of different kinds of Microsoft Visual Studio projects, refer to MSDN Visual Studio documentation at http://www.microsoft.com. To create and configure the Win32/Debug project running an Intel MKL C example with the Intel® C/C++ Compiler integrated into Visual Studio and/or Microsoft Visual C++* 2008, perform the following steps: 1. Create a C Project: a. Open Visual Studio 2008. b. On the main menu, select File > New > Project to open the New Project window. c. Select Project Types > Visual C++ > Win32, then select Templates > Win32 Console Application. In the Name field, type , for example, MKL_CBLAS_CAXPYIX, and click OK. The New Project window closes, and the Win32 Application Wizard - window opens. d. Select Next, then select Application Settings, check Additional options > Empty project, and click Finish. 9 Intel® Math Kernel Library for Windows* OS User's Guide 78The Win32 Application Wizard - window closes. The next steps are performed inside the Solution Explorer window. To open it, select View > Solution Explorer from the main menu. 2. (optional) To switch to the Intel C/C++ project, right-click and from the drop-down menu, select Convert to use Intel® C++ Project System. (The menu item is available if the Intel® C/C++ Compiler is integrated into Visual Studio.) 3. Add sources of the Intel MKL example to the project: a. Right-click the Source Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. b. Browse to the Intel MKL example directory, for example, \examples\cblas \source. Select the example file and supporting files with extension ".c" (C sources), for example, select files cblas_caxpyix.c and common_func.c For the list of supporting files in each example directory, see Support Files for Intel MKL Examples. Click Add. The Add Existing Item - window closes, and selected files appear in the Source Files folder in Solution Explorer. The next steps adjust the properties of the project. 4. Select . 5. On the main menu, select Project > Properties to open the Property Pages window. 6. Set Intel MKL Include dependencies: a. Select Configuration Properties > C/C++ > General. In the right-hand part of the window, select Additional Include Directories > ... (the browse button). The Additional Include Directories window opens. b. Click the New Line button (the first button in the uppermost row). When the new line appears in the window, click the browse button. The Select Directory window opens. c. Browse to the \include directory and click OK. The Select Directory window closes, and full path to the Intel MKL include directory appears in the Additional Include Directories window. d. Click OK to close the window. 7. Set library dependencies: a. Select Configuration Properties > Linker > General. In the right-hand part of the window, select Additional Library Directories > ... (the browse button). The Additional Library Directories window opens. b. Click the New Line button (the first button in the uppermost row). When the new line appears in the window, click the browse button. The Select Directory window opens. c. Browse to the directory with the Intel MKL libraries \lib\, where is one of {ia32, intel64}, for example: \lib\ia32. (For most laptop and desktop computers, is ia32.). Click OK. The Select Directory window closes, and the full path to the Intel MKL libraries appears in the Additional Library Directories window. d. Click the New Line button again. When the new line appears in the window, click the browse button. The Select Directory window opens. e. Browse to the compiler\lib\, where is one of { ia32, intel64 }, for example: \compiler\lib\ia32. Click OK. The Select Directory window closes, and the specified full path appears in the Additional Library Directories window. f. Click OK to close the Additional Library Directories window. g. Select Configuration Properties > Linker > Input. In the right-hand part of the window, select Additional Dependencies > ... (the browse button). The Additional Dependencies window opens. h. Type the libraries required, for example, if =ia32, type mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib For more details, see Linking in Detail. i. Click OK to close the Additional Dependencies window. Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 79j. If the Intel MKL example directory does not contain a data directory, skip the next step. 8. Set data dependencies for the Intel MKL example: a. Select Configuration Properties > Debugging. In the right-hand part of the window, select Command Arguments > > . The Command Arguments window opens. b. Type the path to the proper data file in quotes. The name of the data file is the same as the name of the example file, with a "d" extension, for example, "\examples\cblas\data \cblas_caxpyix.d". c. Click OK to close the Command Arguments window. 9. Click OK to close the Property Pages window. 10.Certain examples do not pause before the end of execution. To see the results printed in the Console window, set a breakpoint at the very last 'return 0;' statement or add a call to 'getchar();' before the last 'return 0' statement. 11.To build the solution, select Build > Build Solution . NOTE You may see warnings about unsafe functions and variables. To get rid of these warnings, go to Project > Properties, and when the Property Pages window opens, go to Configuration Properties > C/C++ > Preprocessor. In the right-hand part of the window, select Preprocessor Definitions, add _CRT_SECURE_NO_WARNINGS, and click OK. 12.To run the example, select Debug > Start Debugging. The Console window opens. 13.You can see the results of the example in the Console window. If you used the 'getchar();' statement to pause execution of the program, press Enter to complete the run. If you used a breakpoint to pause execution of the program, select Debug > Continue. The Console window closes. See Also Running an Intel MKL Example in the Visual Studio* 2008 IDE Creating, Configuring, and Running the Intel Visual Fortran Project This section demonstrates how to create an Intel Visual Fortran project running an Intel MKL example in Microsoft Visual Studio 2008. The instructions below create a Win32/Debug project running one Intel MKL example in a Console window. For details on creation of different kinds of Microsoft Visual Studio projects, refer to MSDN Visual Studio documentation at http://www.microsoft.com. To create and configure a Win32/Debug project running the Intel MKL Fortran example with the Intel Visual Fortran Compiler integrated into Visual Studio, perform the following steps: 1. Create a Visual Fortran Project: a. Open Visual Studio 2008. b. On the main menu, select File > New > Project to open the New Project window. c. Select Project Types > Intel® Fortran > Console Application, then select Templates > Empty Project. When done, in the Name field, type for example, MKL_PDETTF_D_TRIG_TRANSFORM_BVP, and click OK. The New Project window closes. The next steps are performed inside the Solution Explorer window. To open it, select View>Solution Explorer from the main menu. 2. Add sources of Intel MKL example to the project: a. Right-click the Source Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. b. Browse to the Intel MKL example directory, for example, \examples\pdettf \source. Select the example file and supporting files with extension ".f" or ".f90" (Fortran sources). For example, select the d_trig_tforms_bvp.f90 file. For the list of supporting files in each example directory, see Support Files for Intel MKL Examples. Click Add. 9 Intel® Math Kernel Library for Windows* OS User's Guide 80The Add Existing Item - window closes, and the selected files appear in the Source Files folder in Solution Explorer. Some examples with the "use" statements require the next two steps. c. Right-click the Header Files folder under and select Add > Existing Item... from the drop-down menu. The Add Existing Item - window opens. d. Browse to the \include directory. Select the header files that appear in the "use" statements. For example, select the mkl_dfti.f90 and mkl_trig_transforms.f90 files. Click Add. The Add Existing Item - window closes, and the selected files to appear in theHeader Filesfolder in Solution Explorer. The next steps adjust the properties of the project: 3. Select the . 4. On the main menu, select Project > Properties to open the Property Pages window. 5. Set the Intel MKL include dependencies: a. Select Configuration Properties > Fortran > General. In the right-hand part of the window, select Additional Include Directories > > . The Additional Include Directories window opens. b. Type the Intel MKL include directory in quotes: "\include". Click OK to close the window. 6. Select Configuration Properties > Fortran > Preprocessor. In the right-hand part of the window, select Preprocess Source File > Yes (default is No). This step is recommended because some examples require preprocessing. 7. Set library dependencies: a. Select Configuration Properties > Linker > General. In the right-hand part of the window, select Additional Library Directories > > . The Additional Library Directories window opens. b. Type the directory with the Intel MKL libraries in quotes, that is, "\lib \", where is one of { ia32, intel64 }, for example: "\lib\ia32". (For most laptop and desktop computers is ia32.) Click OK to close the window. c. Select Configuration Properties > Linker > Input. In the right-hand part of the window, select Additional Dependencies and type the libraries required, for example, if =ia32, type mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib. 8. In the Property Pages window, click OK to close the window. 9. Some examples do not pause before the end of execution. To see the results printed in the Console window, set a breakpoint at the very end of the program or add the 'pause' statement before the last 'end' statement. 10.To build the solution, select Build > Build Solution. 11.To run the example, select Debug > Start Debugging. The Console window opens. 12.You can see the results of the example in the Console window. If you used 'pause' statement to pause execution of the program, press Enter to complete the run. If you used a breakpoint to pause execution of the program, select Debug > Continue. The Console window closes. Support Files for Intel® Math Kernel Library Examples Below is the list of support files that have to be added to the project for respective examples: examples\cblas\source: common_func.c examples\dftc\source: dfti_example_status_print.c dfti_example_support.c Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 81Known Limitations of the Project Creation Procedure You cannot create a Visual Studio* project using the instructions from Creating, Configuring, and Running the Intel® C/C++ and/or Visual C++* 2008 Project or Creating, Configuring, and Running the Intel® Visual Fortran Project for examples from the following directories: examples\blas examples\blas95 examples\cdftc examples\cdftf examples\dftf examples\fftw2x_cdf examples\fftw2xc examples\fftw2xf examples\fftw3xc examples\fftw3xf examples\java examples\lapack examples\lapack95 Getting Assistance for Programming in the Microsoft Visual Studio* IDE Viewing Intel MKL Documentation in Visual Studio* IDE Viewing Intel MKL Documentation in Document Explorer (Visual Studio* 2005/2008 IDE) Intel MKL documentation is integrated in the Visual Studio IDE (VS) help collection. To open Intel MKL help, 1. Select Help > Contents from the menu. This displays the list of VS Help collections. 2. Click Intel Math Kernel Library Help. 3. In the help tree that expands, click Intel MKL Reference Manual. To open the help index, select Help > Inde x from the menu. To search in the help, select Help > Search from the menu and enter a search string. 9 Intel® Math Kernel Library for Windows* OS User's Guide 82You can filter Visual Studio Help collections to show only content related to installed Intel tools. To do this, select "Intel" from the Filtered by list. This hides the contents and index entries for all collections that do not refer to Intel. Accessing Intel MKL Documentation in Visual Studio* 2010 IDE To access the Intel MKL documentation in Visual Studio* 2010 IDE: • Configure the IDE to use local help (once). To do this, Go to Help > Manage Help Settings and check I want to use online help • Use the Help > View Help menu item to view a list of available help collections and open the Intel MKL documentation. Using Context-Sensitive Help When typing your code in the Visual Studio* (VS) IDE Code Editor, you can get context-sensitive help using the F1 Help and Dynamic Help features. F1 Help To open the help topic relevant to the current selection, press F1. In particular, to open the help topic describing an Intel MKL function called in your code, select the function name and press F1. The topic with the function description opens in the window that displays search results: Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 83Dynamic Help Dynamic Help also provides access to topics relevant to the current selection or to the text being typed. Links to all relevant topics are displayed in the Dynamic Help window. To get the list of relevant topics each time you select the Intel MKL function name or as you type it in your code, open the Dynamic Help window by selecting Help > Dynamic Help from the menu. To open a topic from the list, click the appropriate link in the Dynamic Help window, shown in the above figure. Typically only one link corresponds to each Intel MKL function. Using the IntelliSense* Capability IntelliSense is a set of native Visual Studio*(VS) IDE features that make language references easily accessible. The user programming with Intel MKL in the VS Code Editor can employ two IntelliSense features: Parameter Info and Complete Word. Both features use header files. Therefore, to benefit from IntelliSense, make sure the path to the include files is specified in the VS or solution settings. For example, see Configuring the Microsoft Visual C/C++* Development System to Link with Intel® MKL on how to do this. Parameter Info The Parameter Info feature displays the parameter list for a function to give information on the number and types of parameters. This feature requires adding the include statement with the appropriate Intel MKL header file to your code. To get the list of parameters of a function specified in the header file, 1. Type the function name. 2. Type the opening parenthesis. This brings up the tooltip with the list of the function parameters: 9 Intel® Math Kernel Library for Windows* OS User's Guide 84Complete Word For a software library, the Complete Word feature types or prompts for the rest of the name defined in the header file once you type the first few characters of the name in your code. This feature requires adding the include statement with the appropriate Intel MKL header file to your code. To complete the name of the function or named constant specified in the header file, 1. Type the first few characters of the name. 2. Press Alt+RIGHT ARROW or Ctrl+SPACEBAR. If you have typed enough characters to disambiguate the name, the rest of the name is typed automatically. Otherwise, a pop-up list appears with the names specified in the header file 3. Select the name from the list, if needed. Programming with Intel® Math Kernel Library in Integrated Development Environments (IDE) 9 859 Intel® Math Kernel Library for Windows* OS User's Guide 86LINPACK and MP LINPACK Benchmarks 10 Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Intel® Optimized LINPACK Benchmark for Windows* OS Intel® Optimized LINPACK Benchmark is a generalization of the LINPACK 1000 benchmark. It solves a dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. The generalization is in the number of equations (N) it can solve, which is not limited to 1000. It uses partial pivoting to assure the accuracy of the results. Do not use this benchmark to report LINPACK 100 performance because that is a compiled-code only benchmark. This is a shared-memory (SMP) implementation which runs on a single platform. Do not confuse this benchmark with: • MP LINPACK, which is a distributed memory version of the same benchmark. • LINPACK, the library, which has been expanded upon by the LAPACK library. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your genuine Intel processor systems more easily than with the High Performance Linpack (HPL) benchmark. Use this package to benchmark your SMP machine. Additional information on this software as well as other Intel software performance products is available at http://www.intel.com/software/products/. Contents of the Intel® Optimized LINPACK Benchmark The Intel Optimized LINPACK Benchmark for Windows* OS contains the following files, located in the benchmarks\linpack\ subdirectory of the Intel® Math Kernel Library (Intel® MKL) directory: File in benchmarks \linpack\ Description linpack_xeon32.exe The 32-bit program executable for a system based on Intel® Xeon® processor or Intel® Xeon® processor MP with or without Streaming SIMD Extensions 3 (SSE3). linpack_xeon64.exe The 64-bit program executable for a system with Intel® Xeon® processor using Intel® 64 architecture. runme_xeon32.bat A sample shell script for executing a pre-determined problem set for linpack_xeon32.exe. OMP_NUM_THREADS set to 2 processors. runme_xeon64.bat A sample shell script for executing a pre-determined problem set for linpack_xeon64.exe. OMP_NUM_THREADS set to 4 processors. 87File in benchmarks \linpack\ Description lininput_xeon32 Input file for pre-determined problem for the runme_xeon32 script. lininput_xeon64 Input file for pre-determined problem for the runme_xeon64 script. win_xeon32.txt Result of the runme_xeon32 script execution. win_xeon64.txt Result of the runme_xeon64 script execution. help.lpk Simple help file. xhelp.lpk Extended help file. See Also High-level Directory Structure Running the Software To obtain results for the pre-determined sample problem sizes on a given system, type one of the following, as appropriate: runme_xeon32.bat runme_xeon64.bat To run the software for other problem sizes, see the extended help included with the program. Extended help can be viewed by running the program executable with the -e option: linpack_xeon32.exe -e linpack_xeon64.exe -e The pre-defined data input fileslininput_xeon32 and lininput_xeon64 are provided merely as examples. Different systems have different number of processors or amount of memory and thus require new input files. The extended help can be used for insight into proper ways to change the sample input files. Each input file requires at least the following amount of memory: lininput_xeon32 2 GB lininput_xeon64 16 GB If the system has less memory than the above sample data input requires, you may need to edit or create your own data input files, as explained in the extended help. Each sample script uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme_* sample scripts. If the settings do not yet match the situation for your machine, edit the script. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 10 Intel® Math Kernel Library for Windows* OS User's Guide 88Known Limitations of the Intel® Optimized LINPACK Benchmark The following limitations are known for the Intel Optimized LINPACK Benchmark for Windows* OS: • Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multiprocessor systems, best performance will be obtained with the Intel® Hyper-Threading Technology turned off, which ensures that the operating system assigns threads to physical processors only. • If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file. Intel® Optimized MP LINPACK Benchmark for Clusters Overview of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel® Optimized MP LINPACK Benchmark for Clusters is based on modifications and additions to HPL 2.0 from Innovative Computing Laboratories (ICL) at the University of Tennessee, Knoxville (UTK). The Intel Optimized MP LINPACK Benchmark for Clusters can be used for Top 500 runs (see http://www.top500.org). To use the benchmark you need be intimately familiar with the HPL distribution and usage. The Intel Optimized MP LINPACK Benchmark for Clusters provides some additional enhancements and bug fixes designed to make the HPL usage more convenient, as well as explain Intel® Message-Passing Interface (MPI) settings that may enhance performance. The .\benchmarks\mp_linpack directory adds techniques to minimize search times frequently associated with long runs. The Intel® Optimized MP LINPACK Benchmark for Clusters is an implementation of the Massively Parallel MP LINPACK benchmark by means of HPL code. It solves a random dense (real*8) system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system, converts that time into a performance rate, and tests the results for accuracy. You can solve any size (N) system of equations that fit into memory. The benchmark uses full row pivoting to ensure the accuracy of the results. Use the Intel Optimized MP LINPACK Benchmark for Clusters on a distributed memory machine. On a shared memory machine, use the Intel Optimized LINPACK Benchmark. Intel provides optimized versions of the LINPACK benchmarks to help you obtain high LINPACK benchmark results on your systems based on genuine Intel processors more easily than with the HPL benchmark. Use the Intel Optimized MP LINPACK Benchmark to benchmark your cluster. The prebuilt binaries require that you first install Intel® MPI 3.x be installed on the cluster. The run-time version of Intel MPI is free and can be downloaded from www.intel.com/software/products/ . The Intel package includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratories and neither the University nor ICL endorse or promote this product. Although HPL 2.0 is redistributable under certain conditions, this particular package is subject to the Intel MKL license. Intel MKL has introduced a new functionality into MP LINPACK, which is called a hybrid build, while continuing to support the older version. The term hybrid refers to special optimizations added to take advantage of mixed OpenMP*/MPI parallelism. If you want to use one MPI process per node and to achieve further parallelism by means of OpenMP, use the hybrid build. In general, the hybrid build is useful when the number of MPI processes per core is less than one. If you want to rely exclusively on MPI for parallelism and use one MPI per core, use the non-hybrid build. In addition to supplying certain hybrid prebuilt binaries, Intel MKL supplies some hybrid prebuilt libraries for Intel® MPI to take advantage of the additional OpenMP* optimizations. If you wish to use an MPI version other than Intel MPI, you can do so by using the MP LINPACK source provided. You can use the source to build a non-hybrid version that may be used in a hybrid mode, but it would be missing some of the optimizations added to the hybrid version. Non-hybrid builds are the default of the source code makefiles provided. In some cases, the use of the hybrid mode is required for external reasons. If there is a choice, the non-hybrid code may be faster. To use the non-hybrid code in a hybrid mode, use the threaded version of Intel MKL BLAS, link with a thread-safe MPI, and call function MPI_init_thread() so as to indicate a need for MPI to be thread-safe. LINPACK and MP LINPACK Benchmarks 10 89Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Contents of the Intel® Optimized MP LINPACK Benchmark for Clusters The Intel Optimized MP LINPACK Benchmark for Clusters (MP LINPACK Benchmark) includes the HPL 2.0 distribution in its entirety, as well as the modifications delivered in the files listed in the table below and located in the benchmarks\mp_linpack\ subdirectory of the Intel MKL directory. NOTE Because MP LINPACK Benchmark includes the entire HPL 2.0 distribution, which provides a configuration for Linux* OS only, some Linux OS files remain in the directory. Directory/File in benchmarks \mp_linpack\ Contents testing\ptest\HPL_pdtest.c HPL 2.0 code modified to display captured DGEMM information in ASYOUGO2_DISPLAY if it was captured (for details, see New Features). src\blas\HPL_dgemm.c HPL 2.0 code modified to capture DGEMM information, if desired, from ASYOUGO2_DISPLAY. src\grid\HPL_grid_init.c HPL 2.0 code modified to do additional grid experiments originally not in HPL 2.0. src\pgesv\HPL_pdgesvK2.c HPL 2.0 code modified to do ASYOUGO and ENDEARLY modifications. src\pgesv\HPL_pdgesv0.c HPL 2.0 code modified to do ASYOUGO, ASYOUGO2, and ENDEARLY modifications. testing\ptest\HPL.dat HPL 2.0 sample HPL.dat modified. makes All the makefiles in this directory have been rebuilt in the Windows OS distribution. testing\ptimer\ Some files in here have been modified in the Windows OS distribution. testing\timer\ Some files in here have been modified in the Windows OS distribution. Make (New) Sample architecture makefile for nmake utility to be used on processors based on the IA-32 and Intel® 64 architectures and Windows OS. bin_intel\ia32\xhpl_ia32.exe (New) Prebuilt binary for the IA-32 architecture, Windows OS, and Intel® MPI. bin_intel \intel64\xhpl_intel64.exe (New) Prebuilt binary for the Intel® 64 architecture, Windows OS, and Intel MPI. 10 Intel® Math Kernel Library for Windows* OS User's Guide 90Directory/File in benchmarks \mp_linpack\ Contents lib_hybrid \ia32\libhpl_hybrid.lib (New) Prebuilt library with the hybrid version of MP LINPACK for the IA-32 architecture and Intel MPI. lib_hybrid \intel64\libhpl_hybrid.lib (New) Prebuilt library with the hybrid version of MP LINPACK for the Intel® 64 architecture and Intel MPI. bin_intel \ia32\xhpl_hybrid_ia32.exe (New) Prebuilt hybrid binary for the IA-32 architecture, Windows OS, and Intel MPI. bin_intel \intel64\xhpl_hybrid_intel64.exe (New) Prebuilt hybrid binary for the Intel® 64 architecture, Windows OS, and Intel MPI. nodeperf.c (New) Sample utility that tests the DGEMM speed across the cluster. See Also High-level Directory Structure Building the MP LINPACK The MP LINPACK Benchmark contains a few sample architecture makefiles. You can edit them to fit your specific configuration. Specifically: • Set TOPdir to the directory that MP LINPACK is being built in. • Set MPI variables, that is, MPdir, MPinc, and MPlib. • Specify the location Intel MKL and of files to be used (LAdir, LAinc, LAlib). • Adjust compiler and compiler/linker options. • Specify the version of MP LINPACK you are going to build (hybrid or non-hybrid) by setting the version parameter for the nmake command. For example: nmake arch=intel64 mpi=intelmpi version=hybrid install For some sample cases, the makefiles contain values that must be common. However, you need to be familiar with building an HPL and picking appropriate values for these variables. New Features of Intel® Optimized MP LINPACK Benchmark The toolset is basically identical with the HPL 2.0 distribution. There are a few changes that are optionally compiled in and disabled until you specifically request them. These new features are: ASYOUGO: Provides non-intrusive performance information while runs proceed. There are only a few outputs and this information does not impact performance. This is especially useful because many runs can go for hours without any information. ASYOUGO2: Provides slightly intrusive additional performance information by intercepting every DGEMM call. ASYOUGO2_DISPLAY: Displays the performance of all the significant DGEMMs inside the run. ENDEARLY: Displays a few performance hints and then terminates the run early. FASTSWAP: Inserts the LAPACK-optimized DLASWP into HPL's code. You can experiment with this to determine best results. HYBRID: Establishes the Hybrid OpenMP/MPI mode of MP LINPACK, providing the possibility to use threaded Intel MKL and prebuilt MP LINPACK hybrid libraries. CAUTION Use this option only with an Intel compiler and the Intel® MPI library version 3.1 or higher. You are also recommended to use the compiler version 10.0 or higher. LINPACK and MP LINPACK Benchmarks 10 91Benchmarking a Cluster To benchmark a cluster, follow the sequence of steps below (some of them are optional). Pay special attention to the iterative steps 3 and 4. They make a loop that searches for HPL parameters (specified in HPL.dat) that enable you to reach the top performance of your cluster. 1. Install HPL and make sure HPL is functional on all the nodes. 2. You may run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes. Compile nodeperf.c with your MPI and Intel MKL. For example: icl /Za /O3 /w /D_WIN_ /I"\include" "\" "\lib\intel64\mkl_core.lib" "\lib\intel64\libiomp5md.lib" nodeperf.c where is msmpi.lib in the case of Microsoft* MPI and mpi.lib in the case of MPICH. Launching nodeperf.c on all the nodes is especially helpful in a very large cluster. nodeperf enables quick identification of the potential problem spot without numerous small MP LINPACK runs around the cluster in search of the bad node. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by some host identifier. Therefore, the higher the DGEMM performance, the faster that node was performing. 3. Edit HPL.dat to fit your cluster needs. Read through the HPL documentation for ideas on this. Note, however, that you should use at least 4 nodes. 4. Make an HPL run, using compile options such as ASYOUGO, ASYOUGO2, or ENDEARLY to aid in your search. These options enable you to gain insight into the performance sooner than HPL would normally give this insight. When doing so, follow these recommendations: • Use MP LINPACK, which is a patched version of HPL, to save time in the search. All performance intrusive features are compile-optional in MP LINPACK. That is, if you do not use the new options to reduce search time, these features are disabled. The primary purpose of the additions is to assist you in finding solutions. HPL requires a long time to search for many different parameters. In MP LINPACK, the goal is to get the best possible number. Given that the input is not fixed, there is a large parameter space you must search over. An exhaustive search of all possible inputs is improbably large even for a powerful cluster. MP LINPACK optionally prints information on performance as it proceeds. You can also terminate early. • Save time by compiling with -DENDEARLY -DASYOUGO2 and using a negative threshold (do not use a negative threshold on the final run that you intend to submit as a Top500 entry). Set the threshold in line 13 of the HPL 2.0 input file HPL.dat • If you are going to run a problem to completion, do it with -DASYOUGO. 5. Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible. See Also Options to Reduce Search Time Options to Reduce Search Time Running large problems to completion on large numbers of nodes can take many hours. The search space for MP LINPACK is also large: not only can you run any size problem, but over a number of block sizes, grid layouts, lookahead steps, using different factorization methods, and so on. It can be a large waste of time to run a large problem to completion only to discover it ran 0.01% slower than your previous best problem. Use the following options to reduce the search time: 10 Intel® Math Kernel Library for Windows* OS User's Guide 92• -DASYOUGO • -DENDEARLY • -DASYOUGO2 Use -DASYOUGO2 cautiously because it does have a marginal performance impact. To see DGEMM internal performance, compile with -DASYOUGO2 and -DASYOUGO2_DISPLAY. These options provide a lot of useful DGEMM performance information at the cost of around 0.2% performance loss. If you want to use the old HPL, simply omit these options and recompile from scratch. To do this, try "nmake arch= clean_arch_all". -DASYOUGO -DASYOUGO gives performance data as the run proceeds. The performance always starts off higher and then drops because this actually happens in LU decomposition (a decomposition of a matrix into a product of a lower (L) and upper (U) triangular matrices). The ASYOUGO performance estimate is usually an overestimate (because the LU decomposition slows down as it goes), but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where one is in the LU decomposition that MP LINPACK performs and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. So, refer to the description of the -DASYOUGO2 option below for the details of the output. -DENDEARLY -DENDEARLY t erminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You do not need to define both, although it doesn't hurt. To avoid the residual check for a problem that terminates early, set the "threshold" parameter in HPL.dat to a negative number when testing ENDEARLY. It also sometimes gives a better picture to compile with -DASYOUGO2 when using - DENDEARLY. Usage notes on -DENDEARLY follow: • -DENDEARLY stops the problem after a few iterations of DGEMM on the block size (the bigger the blocksize, the further it gets). It prints only 5 or 6 "updates", whereas -DASYOUGO prints about 46 or so output elements before the problem completes. • Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (because that is what LU does). -DENDEARLY is likely to terminate before it starts to slow down. • -DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong because the problem never completed. However, you can get an idea what the initial performance was, and if it looks good, then run the problem to completion without - DENDEARLY. To avoid the error check, you can set HPL's threshold parameter in HPL.dat to a negative number. • Though -DENDEARLY terminates early, HPL treats the problem as completed and computes Gflop rating as though the problem ran to completion. Ignore this erroneously high rating. • The bigger the problem, the more accurately the last update that -DENDEARLY returns is close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you are suggested to use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting. LINPACK and MP LINPACK Benchmarks 10 93-DASYOUGO2 -DASYOUGO2 gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal intrusive overhead. Unlike -DASYOUGO, which is quite non-intrusive, -DASYOUGO2 interrupts every DGEMM call to monitor its performance. You should beware of this overhead, although for big problems, it is, less than 0.1%. Here is a sample ASYOUGO2 output (the first 3 non-intrusive numbers can be found in ASYOUGO and ENDEARLY), so it suffices to describe these numbers here: Col=001280 Fract=0.050 Mflops=42454.99 (DT=9.5 DF=34.1 DMF=38322.78). The problem size was N=16000 with a block size of 128. After 10 blocks, that is, 1280 columns, an output was sent to the screen. Here, the fraction of columns completed is 1280/16000=0.08. Only up to 40 outputs are printed, at various places through the matrix decomposition: fractions 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385 0.390 0.395 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.515 0.535 0.555 0.575 0.595 0.615 0.635 0.655 0.675 0.695 0.795 0.895. However, this problem size is so small and the block size so big by comparison that as soon as it prints the value for 0.045, it was already through 0.08 fraction of the columns. On a really big problem, the fractional number will be more accurate. It never prints more than the 112 numbers above. So, smaller problems will have fewer than 112 updates, and the biggest problems will have precisely 112 updates. Mflops is an estimate based on 1280 columns of LU being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs. The 3 numbers in parenthesis are intrusive ASYOUGO2 addins. DT is the total time processor 0 has spent in DGEMM. DF is the number of billion operations that have been performed in DGEMM by one processor. Hence, the performance of processor 0 (in Gflops) in DGEMM is always DF/DT. Using the number of DGEMM flops as a basis instead of the number of LU flops, you get a lower bound on performance of the run by looking at DMF, which can be compared to Mflops above (It uses the global LU time, but the DGEMM flops are computed under the assumption that the problem is evenly distributed amongst the nodes, as only HPL's node (0,0) returns any output.) Note that when using the above performance monitoring tools to compare different HPL.dat input data sets, you should be aware that the pattern of performance drop-off that LU experiences is sensitive to some input data. For instance, when you try very small problems, the performance drop-off from the initial values to end values is very rapid. The larger the problem, the less the drop-off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop-off is the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal in value. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps. Using these tools will greatly assist the amount of data you can test. See Also Benchmarking a Cluster 10 Intel® Math Kernel Library for Windows* OS User's Guide 94Intel® Math Kernel Library Language Interfaces Support A Language Interfaces Support, by Function Domain The following table shows language interfaces that Intel® Math Kernel Library (Intel® MKL) provides for each function domain. However, Intel MKL routines can be called from other languages using mixed-language programming. See Mixed-language Programming with Intel® MKL for an example of how to call Fortran routines from C/C++. Function Domain FORTRAN 77 interface Fortran 9 0/95 interface C/C++ interface Basic Linear Algebra Subprograms (BLAS) Yes Yes via CBLAS BLAS-like extension transposition routines Yes Yes Sparse BLAS Level 1 Yes Yes via CBLAS Sparse BLAS Level 2 and 3 Yes Yes Yes LAPACK routines for solving systems of linear equations Yes Yes Yes LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Yes Yes Yes Auxiliary and utility LAPACK routines Yes Yes Parallel Basic Linear Algebra Subprograms (PBLAS) Yes ScaLAPACK routines Yes † DSS/PARDISO* solvers Yes Yes Yes Other Direct and Iterative Sparse Solver routines Yes Yes Yes Vector Mathematical Library (VML) functions Yes Yes Yes Vector Statistical Library (VSL) functions Yes Yes Yes Fourier Transform functions (FFT) Yes Yes Cluster FFT functions Yes Yes Trigonometric Transform routines Yes Yes Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Yes Yes Optimization (Trust-Region) Solver routines Yes Yes Yes Data Fitting functions Yes Yes Yes GMP* arithmetic functions †† Yes Support functions (including memory allocation) Yes Yes Yes † Supported using a mixed language programming call. See Intel ® MKL Include Files for the respective header file. 95†† GMP Arithmetic Functions are deprecated and will be removed in a future release. Include Files Function domain Fortran Include Files C/C++ Include Files All function domains mkl.fi mkl.h BLAS Routines blas.f90 mkl_blas.fi mkl_blas.h BLAS-like Extension Transposition Routines mkl_trans.fi mkl_trans.h CBLAS Interface to BLAS mkl_cblas.h Sparse BLAS Routines mkl_spblas.fi mkl_spblas.h LAPACK Routines lapack.f90 mkl_lapack.fi mkl_lapack.h C Interface to LAPACK mkl_lapacke.h ScaLAPACK Routines mkl_scalapack.h All Sparse Solver Routines mkl_solver.f90 mkl_solver.h PARDISO mkl_pardiso.f77 mkl_pardiso.f90 mkl_pardiso.h DSS Interface mkl_dss.f77 mkl_dss.f90 mkl_dss.h RCI Iterative Solvers ILU Factorization mkl_rci.fi mkl_rci.h Optimization Solver Routines mkl_rci.fi mkl_rci.h Vector Mathematical Functions mkl_vml.f77 mkl_vml.90 mkl_vml.h Vector Statistical Functions mkl_vsl.f77 mkl_vsl.f90 mkl_vsl_functions.h Fourier Transform Functions mkl_dfti.f90 mkl_dfti.h Cluster Fourier Transform Functions mkl_cdft.f90 mkl_cdft.h Partial Differential Equations Support Routines Trigonometric Transforms mkl_trig_transforms.f90 mkl_trig_transforms.h Poisson Solvers mkl_poisson.f90 mkl_poisson.h Data Fitting functions mkl_df.f77 mkl_df.f90 mkl_df.h GMP interface † mkl_gmp.h Support functions mkl_service.f90 mkl_service.h A Intel® Math Kernel Library for Windows* OS User's Guide 96Function domain Fortran Include Files C/C++ Include Files mkl_service.fi Memory allocation routines i_malloc.h Intel MKL examples interface mkl_example.h † GMP Arithmetic Functions are deprecated and will be removed in a future release. See Also Language Interfaces Support, by Function Domain Intel® Math Kernel Library Language Interfaces Support A 97A Intel® Math Kernel Library for Windows* OS User's Guide 98Support for Third-Party Interfaces B GMP* Functions Intel® Math Kernel Library (Intel® MKL) implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http:// software.intel.com/sites/products/documentation/hpc/mkl/gnump/index.htm. NOTE Intel MKL GMP Arithmetic Functions are deprecated and will be removed in a future release. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h. FFTW Interface Support Intel® Math Kernel Library (Intel® MKL) offers two collections of wrappers for the FFTW interface (www.fftw.org). The wrappers are the superstructure of FFTW to be used for calling the Intel MKL Fourier transform functions. These collections correspond to the FFTW versions 2.x and 3.x and the Intel MKL versions 7.0 and later. These wrappers enable using Intel MKL Fourier transforms to improve the performance of programs that use FFTW without changing the program source code. See the "FFTW Interface to Intel® Math Kernel Library" appendix in the Intel MKL Reference Manual for details on the use of the wrappers. Important For ease of use, FFTW3 interface is also integrated in Intel MKL. 99B Intel® Math Kernel Library for Windows* OS User's Guide 100Directory Structure in Detail C Tables in this section show contents of the Intel(R) Math Kernel Library (Intel(R) MKL) architecture-specific directories. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Detailed Structure of the IA-32 Architecture Directories Static Libraries in the lib\ia32 Directory File Contents Interface layer mkl_intel_c.lib cdecl interface library mkl_intel_s.lib CVF default interface library mkl_blas95.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler mkl_lapack95.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler Threading layer mkl_intel_thread.lib Threading library for the Intel compilers mkl_pgi_thread.lib Threading library for the PGI* compiler mkl_sequential.lib Sequential library Computational layer mkl_core.lib Kernel library for IA-32 architecture mkl_solver.lib Deprecated. Empty library for backward compatibility mkl_solver_sequential.lib Deprecated. Empty library for backward compatibility mkl_scalapack_core.lib ScaLAPACK routines mkl_cdft_core.lib Cluster version of FFTs Run-time Libraries (RTL) 101File Contents mkl_blacs_intelmpi.lib BLACS routines supporting Intel MPI mkl_blacs_mpich2.lib BLACS routines supporting MPICH2 Dynamic Libraries in the lib\ia32 Directory File Contents mkl_rt.lib Single Dynamic Library to be used for linking Interface layer mkl_intel_c_dll.lib cdecl interface library for dynamic linking mkl_intel_s_dll.lib CVF default interface library for dynamic linking Threading layer mkl_intel_thread_dll.lib Threading library for dynamic linking with the Intel compilers mkl_pgi_thread_dll.lib Threading library for dynamic linking with the PGI* compiler mkl_sequential_dll.lib Sequential library for dynamic linking Computational layer mkl_core_dll.lib Core library for dynamic linking mkl_scalapack_core_dll.lib ScaLAPACK routine library for dynamic linking mkl_cdft_core_dll.lib Cluster FFT library for dynamic linking Run-time Libraries (RTL) mkl_blacs_dll.lib BLACS interface library for dynamic linking Contents of the redist\ia32\mkl Directory File Contents mkl_rt.dll Single Dynamic Library Threading layer mkl_intel_thread.dll Dynamic threading library for the Intel compilers mkl_pgi_thread.dll Dynamic threading library for the PGI* compiler mkl_sequential.dll Dynamic sequential library Computational layer mkl_core.dll Core library containing processor-independent code and a dispatcher for dynamic loading of processor-specific code mkl_def.dll Default kernel (Intel® Pentium®, Pentium® Pro, Pentium® II, and Pentium® III processors) C Intel® Math Kernel Library for Windows* OS User's Guide 102File Contents mkl_p4.dll Pentium® 4 processor kernel mkl_p4p.dll Kernel for the Intel® Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3), including Intel® Core™ Duo and Intel® Core™ Solo processors. mkl_p4m.dll Kernel for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_p4p.dll is intended) mkl_p4m3.dll Kernel for the Intel® Core™ i7 processors mkl_vml_def.dll VML/VSL part of default kernel for old Intel® Pentium® processors mkl_vml_ia.dll VML/VSL default kernel for newer Intel® architecture processors mkl_vml_p4.dll VML/VSL part of Pentium® 4 processor kernel mkl_vml_p4p.dll VML/VSL for Pentium® 4 processor with Streaming SIMD Extensions 3 (SSE3) mkl_vml_p4m.dll VML/VSL for processors based on the Intel® Core™ microarchitecture (except Intel® Core™ Duo and Intel® Core™ Solo processors, for which mkl_vml_p4p.dll is intended). mkl_vml_p4m2.dll VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families mkl_vml_p4m3.dll VML/VSL for the Intel® Core™ i7 processors mkl_vml_avx.dll VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) mkl_scalapack_core.dll ScaLAPACK routines mkl_cdft_core.dll Cluster FFT dynamic library libimalloc.dll Dynamic library to support renaming of memory functions Run-time Libraries (RTL) mkl_blacs.dll BLACS routines mkl_blacs_intelmpi.dll BLACS routines supporting Intel MPI mkl_blacs_mpich2.dll BLACS routines supporting MPICH2 1033\mkl_msg.dll Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English 1041\mkl_msg.dll Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information Detailed Structure of the Intel® 64 Architecture Directories Directory Structure in Detail C 103Static Libraries in the lib\intel64 Directory File Contents Interface layer mkl_intel_lp64.lib LP64 interface library for the Intel compilers mkl_intel_ilp64.lib ILP64 interface library for the Intel compilers mkl_intel_sp2dp.a SP2DP interface library for the Intel compilers mkl_blas95_lp64.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler and LP64 interface mkl_blas95_ilp64.lib Fortran 95 interface library for BLAS. Supports the Intel® Fortran compiler and ILP64 interface mkl_lapack95_lp64.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler and LP64 interface mkl_lapack95_ilp64.lib Fortran 95 interface library for LAPACK. Supports the Intel® Fortran compiler and ILP64 interface Threading layer mkl_intel_thread.lib Threading library for the Intel compilers mkl_pgi_thread.lib Threading library for the PGI* compiler mkl_sequential.lib Sequential library Computational layer mkl_core.lib Kernel library for the Intel® 64 architecture mkl_solver_lp64.lib Deprecated. Empty library for backward compatibility mkl_solver_lp64_sequential.lib Deprecated. Empty library for backward compatibility mkl_solver_ilp64.lib Deprecated. Empty library for backward compatibility mkl_solver_ilp64_sequential.lib Deprecated. Empty library for backward compatibility mkl_scalapack_lp64.lib ScaLAPACK routine library supporting the LP64 interface mkl_scalapack_ilp64.lib ScaLAPACK routine library supporting the ILP64 interface mkl_cdft_core.lib Cluster version of FFTs Run-time Libraries (RTL) mkl_blacs_intelmpi_lp64.lib LP64 version of BLACS routines supporting Intel MPI mkl_blacs_intelmpi_ilp64.lib ILP64 version of BLACS routines supporting Intel MPI mkl_blacs_mpich2_lp64.lib LP64 version of BLACS routines supporting MPICH2 mkl_blacs_mpich2_ilp64.lib ILP64 version of BLACS routines supporting MPICH2 mkl_blacs_msmpi_lp64.lib LP64 version of BLACS routines supporting Microsoft* MPI mkl_blacs_msmpi_ilp64.lib ILP64 version of BLACS routines supporting Microsoft* MPI C Intel® Math Kernel Library for Windows* OS User's Guide 104Dynamic Libraries in the lib\intel64 Directory File Contents mkl_rt.lib Single Dynamic Library to be used for linking Interface layer mkl_intel_lp64_dll.lib LP64 interface library for dynamic linking with the Intel compilers mkl_intel_ilp64_dll.lib ILP64 interface library for dynamic linking with the Intel compilers Threading layer mkl_intel_thread_dll.lib Threading library for dynamic linking with the Intel compilers mkl_pgi_thread_dll.lib Threading library for dynamic linking with the PGI* compiler mkl_sequential_dll.lib Sequential library for dynamic linking Computational layer mkl_core_dll.lib Core library for dynamic linking mkl_scalapack_lp64_dll.lib ScaLAPACK routine library for dynamic linking supporting the LP64 interface mkl_scalapack_ilp64_dll.lib ScaLAPACK routine library for dynamic linking supporting the ILP64 interface mkl_cdft_core_dll.lib Cluster FFT library for dynamic linking Run-time Libraries (RTL) mkl_blacs_lp64_dll.lib LP64 version of BLACS interface library for dynamic linking mkl_blacs_ilp64_dll.lib ILP64 version of BLACS interface library for dynamic linking Contents of the redist\intel64\mkl Directory File Contents mkl_rt.dll Single Dynamic Library Threading layer mkl_intel_thread.dll Dynamic threading library for the Intel compilers mkl_pgi_thread.dll Dynamic threading library for the PGI* compiler mkl_sequential.dll Dynamic sequential library Computational layer mkl_core.dll Core library containing processor-independent code and a Directory Structure in Detail C 105File Contents dispatcher for dynamic loading of processor-specific code mkl_def.dll Default kernel for the Intel® 64 architecture mkl_p4n.dll Kernel for the Intel® Xeon® processor using the Intel® 64 architecture mkl_mc.dll Kernel for processors based on the Intel® Core™ microarchitecture mkl_mc3.dll Kernel for the Intel® Core™ i7 processors mkl_avx.dll Kernel optimized for the Intel® Advanced Vector Extensions (Intel® AVX). mkl_vml_def.dll VML/VSL part of default kernel mkl_vml_p4n.dll VML/VSL for the Intel® Xeon® processor using the Intel® 64 architecture mkl_vml_mc.dll VML/VSL for processors based on the Intel® Core™ microarchitecture mkl_vml_mc2.dll VML/VSL for 45nm Hi-k Intel® Core™2 and Intel Xeon® processor families mkl_vml_mc3.dll VML/VSL for the Intel® Core® i7 processors mkl_vml_avx.dll VML/VSL optimized for the Intel® Advanced Vector Extensions (Intel® AVX) mkl_scalapack_lp64.dll ScaLAPACK routine library supporting the LP64 interface mkl_scalapack_ilp64.dll ScaLAPACK routine library supporting the ILP64 interface mkl_cdft_core.dll Cluster FFT dynamic library libimalloc.dll Dynamic library to support renaming of memory functions Run-time Libraries (RTL) mkl_blacs_lp64.dll LP64 version of BLACS routines mkl_blacs_ilp64.dll ILP64 version of BLACS routines mkl_blacs_intelmpi_lp64.dll LP64 version of BLACS routines supporting Intel MPI mkl_blacs_intelmpi_ilp64.dll ILP64 version of BLACS routines supporting Intel MPI mkl_blacs_mpich2_lp64.dll LP64 version of BLACS routines supporting MPICH2 mkl_blacs_mpich2_ilp64.dll ILP64 version of BLACS routines supporting MPICH2 mkl_blacs_msmpi_lp64.dll LP64 version of BLACS routines supporting Microsoft* MPI mkl_blacs_msmpi_ilp64.dll ILP64 version of BLACS routines supporting Microsoft* MPI 1033\mkl_msg.dll Catalog of Intel® Math Kernel Library (Intel® MKL) messages in English 1041\mkl_msg.dll Catalog of Intel MKL messages in Japanese. Available only if the Intel® MKL package provides Japanese localization. Please see the Release Notes for this information C Intel® Math Kernel Library for Windows* OS User's Guide 106Index A affinity mask 53 aligning data 69 architecture support 23 B BLAS calling routines from C 61 Fortran 95 interface to 59 threaded routines 43 building a custom DLL in Visual Studio* IDE 41 C C interface to LAPACK, use of 61 C, calling LAPACK, BLAS, CBLAS from 61 C/C++, Intel(R) MKL complex types 62 calling BLAS functions from C 63 CBLAS interface from C 63 complex BLAS Level 1 function from C 63 complex BLAS Level 1 function from C++ 63 Fortran-style routines from C 61 calling convention, cdecl and stdcall 19 CBLAS interface, use of 61 cdecl interface, use of 33 Cluster FFT, linking with 71 cluster software, Intel(R) MKL cluster software, linking with commands 71 linking examples 74 code examples, use of 19 coding data alignment techniques to improve performance 52 compilation, Intel(R) MKL version-dependent 70 compiler run-time libraries, linking with 38 compiler support 19 compiler-dependent function 59 complex types in C and C++, Intel(R) MKL 62 computation results, consistency 69 computational libraries, linking with 37 conditional compilation 70 configuring Intel(R) Visual Fortran 77 Microsoft Visual* C/C++ 77 project that runs Intel(R) MKL code example in Visual Studio* 2008 IDE 78 consistent results 69 context-sensitive Help, for Intel(R) MKL, in Visual Studio* IDE 83 conventions, notational 13 ctdcall interface, use of 33 custom DLL building 39 composing list of functions 40 specifying function names 41 CVF calling convention, use with Intel(R) MKL 60 D denormal number, performance 54 directory structure documentation 26 high-level 23 in-detail documentation directories, contents 26 E Enter index keyword 27 environment variables, setting 17 examples, linking for cluster software 74 general 30 F FFT interface data alignment 52 optimised radices 54 threaded problems 43 FFTW interface support 99 Fortran 95 interface libraries 36 G GNU* Multiple Precision Arithmetic Library 99 H header files, Intel(R) MKL 96 Help, for Intel(R) MKL in Visual Studio* IDE 82 HT technology, configuration tip 53 hybrid, version, of MP LINPACK 89 I ILP64 programming, support for 34 include files, Intel(R) MKL 96 installation, checking 17 Intel(R) Hyper-Threading Technology, configuration tip 53 Intel(R) Visual* Fortran project, linking with Intel(R) MKL 28 IntelliSense*, with Intel(R) MKL, in Visual Studio* IDE 84 interface cdecl and stdcall, use of 33 Fortran 95, libraries 36 LP64 and ILP64, use of 34 interface libraries and modules, Intel(R) MKL 57 interface libraries, linking with 33 J Java* examples 66 L language interfaces support 95 language-specific interfaces interface libraries and modules 57 LAPACK Index 107C interface to, use of 61 calling routines from C 61 Fortran 95 interface to 59 performance of packed routines 52 threaded routines 43 layers, Intel(R) MKL structure 25 libraries to link with computational 37 interface 33 run-time 38 system libraries 38 threading 36 link tool, command line 30 linking Intel(R) Visual* Fortran project with Intel(R) MKL 28 Microsoft Visual* C/C++ project with Intel(R) MKL 28 linking examples cluster software 74 general 30 linking with compiler run-time libraries 38 computational libraries 37 interface libraries 33 system libraries 38 threading libraries 36 linking, quick start 27 linking, Web-based advisor 29 LINPACK benchmark 87 M memory functions, redefining 55 memory management 54 memory renaming 55 Microsoft Visual* C/C++ project, linking with Intel(R) MKL 28 mixed-language programming 61 module, Fortran 95 59 MP LINPACK benchmark 89 multi-core performance 53 N notational conventions 13 number of threads changing at run time 46 changing with OpenMP* environment variable 46 Intel(R) MKL choice, particular cases 49 setting for cluster 73 techniques to set 46 P parallel performance 45 parallelism, of Intel(R) MKL 43 performance multi-core 53 with denormals 54 with subnormals 54 S ScaLAPACK, linking with 71 SDL 28, 32 sequential mode of Intel(R) MKL 36 Single Dynamic Library 28, 32 stdcall calling convention, use in C/C++ 60 structure high-level 23 in-detail model 25 support, technical 11 supported architectures 23 system libraries, linking with 38 T technical support 11 thread safety, of Intel(R) MKL 43 threaded functions 43 threaded problems 43 threading control, Intel(R) MKL-specific 48 threading libraries, linking with 36 U uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions 64 unstable output, getting rid of 69 usage information 15 V Visual Studio* 2008 IDE, configuring a project that runs Intel(R) MKL code example 78 Visual Studio* IDE IntelliSense*, with Intel(R) MKL 84 using Intel(R) MKL context-sensitive Help in 83 Veiwing Intel(R) MKL documentation in 82 Intel® Math Kernel Library for Windows* OS User's Guide 108 Intel® Math Kernel Library Reference Manual Document Number: 630813-045US MKL 10.3 Update 8 Legal Information Contents Legal Information..............................................................................33 Introducing the Intel® Math Kernel Library.........................................35 Getting Help and Support...................................................................37 What's New........................................................................................39 Notational Conventions......................................................................41 Chapter 1: Function Domains BLAS Routines.........................................................................................44 Sparse BLAS Routines..............................................................................44 LAPACK Routines.....................................................................................44 ScaLAPACK Routines................................................................................44 PBLAS Routines.......................................................................................45 Sparse Solver Routines.............................................................................45 VML Functions.........................................................................................46 Statistical Functions.................................................................................46 Fourier Transform Functions......................................................................46 Partial Differential Equations Support..........................................................46 Nonlinear Optimization Problem Solvers......................................................47 Support Functions....................................................................................47 BLACS Routines.......................................................................................47 Data Fitting Functions...............................................................................48 GMP Arithmetic Functions..........................................................................48 Performance Enhancements......................................................................48 Parallelism..............................................................................................49 C Datatypes Specific to Intel MKL...............................................................49 Chapter 2: BLAS and Sparse BLAS Routines BLAS Routines.........................................................................................51 Routine Naming Conventions.............................................................51 Fortran 95 Interface Conventions.......................................................52 Matrix Storage Schemes...................................................................53 BLAS Level 1 Routines and Functions..................................................53 ?asum....................................................................................54 ?axpy....................................................................................55 ?copy.....................................................................................56 ?dot.......................................................................................58 ?sdot.....................................................................................59 ?dotc.....................................................................................60 ?dotu.....................................................................................61 ?nrm2....................................................................................62 ?rot.......................................................................................63 ?rotg.....................................................................................64 ?rotm....................................................................................65 ?rotmg...................................................................................67 ?scal......................................................................................69 Contents 3 ?swap....................................................................................70 i?amax...................................................................................71 i?amin...................................................................................72 ?cabs1...................................................................................73 BLAS Level 2 Routines......................................................................74 ?gbmv...................................................................................75 ?gemv...................................................................................77 ?ger......................................................................................79 ?gerc.....................................................................................81 ?geru.....................................................................................82 ?hbmv...................................................................................84 ?hemv...................................................................................86 ?her......................................................................................87 ?her2.....................................................................................89 ?hpmv...................................................................................91 ?hpr......................................................................................92 ?hpr2.....................................................................................94 ?sbmv....................................................................................95 ?spmv....................................................................................98 ?spr.......................................................................................99 ?spr2...................................................................................101 ?symv..................................................................................102 ?syr.....................................................................................104 ?syr2...................................................................................106 ?tbmv..................................................................................107 ?tbsv...................................................................................109 ?tpmv..................................................................................112 ?tpsv...................................................................................113 ?trmv...................................................................................115 ?trsv....................................................................................117 BLAS Level 3 Routines....................................................................118 ?gemm.................................................................................119 ?hemm.................................................................................122 ?herk...................................................................................124 ?her2k.................................................................................126 ?symm.................................................................................128 ?syrk...................................................................................131 ?syr2k..................................................................................133 ?trmm..................................................................................135 ?trsm...................................................................................138 Sparse BLAS Level 1 Routines..................................................................140 Vector Arguments..........................................................................140 Naming Conventions......................................................................140 Routines and Data Types................................................................141 BLAS Level 1 Routines That Can Work With Sparse Vectors.................141 ?axpyi..........................................................................................141 ?doti............................................................................................143 ?dotci...........................................................................................144 ?dotui...........................................................................................145 ?gthr............................................................................................146 Intel® Math Kernel Library Reference Manual 4 ?gthrz..........................................................................................147 ?roti.............................................................................................148 ?sctr............................................................................................149 Sparse BLAS Level 2 and Level 3 Routines.................................................151 Naming Conventions in Sparse BLAS Level 2 and Level 3.....................151 Sparse Matrix Storage Formats........................................................152 Routines and Supported Operations..................................................152 Interface Consideration...................................................................153 Sparse BLAS Level 2 and Level 3 Routines.........................................158 mkl_?csrgemv.......................................................................161 mkl_?bsrgemv......................................................................164 mkl_?coogemv......................................................................166 mkl_?diagemv.......................................................................169 mkl_?csrsymv.......................................................................171 mkl_?bsrsymv.......................................................................173 mkl_?coosymv......................................................................176 mkl_?diasymv.......................................................................178 mkl_?csrtrsv.........................................................................181 mkl_?bsrtrsv.........................................................................184 mkl_?cootrsv........................................................................186 mkl_?diatrsv.........................................................................189 mkl_cspblas_?csrgemv...........................................................192 mkl_cspblas_?bsrgemv...........................................................194 mkl_cspblas_?coogemv..........................................................197 mkl_cspblas_?csrsymv...........................................................199 mkl_cspblas_?bsrsymv...........................................................202 mkl_cspblas_?coosymv..........................................................204 mkl_cspblas_?csrtrsv.............................................................207 mkl_cspblas_?bsrtrsv.............................................................209 mkl_cspblas_?cootrsv............................................................212 mkl_?csrmv..........................................................................215 mkl_?bsrmv..........................................................................218 mkl_?cscmv..........................................................................222 mkl_?coomv.........................................................................225 mkl_?csrsv...........................................................................228 mkl_?bsrsv...........................................................................232 mkl_?cscsv...........................................................................235 mkl_?coosv...........................................................................239 mkl_?csrmm.........................................................................242 mkl_?bsrmm.........................................................................246 mkl_?cscmm.........................................................................250 mkl_?coomm........................................................................254 mkl_?csrsm..........................................................................257 mkl_?cscsm..........................................................................261 mkl_?coosm..........................................................................265 mkl_?bsrsm..........................................................................268 mkl_?diamv..........................................................................272 mkl_?skymv.........................................................................275 mkl_?diasv...........................................................................278 mkl_?skysv...........................................................................281 Contents 5 mkl_?diamm.........................................................................284 mkl_?skymm........................................................................288 mkl_?diasm..........................................................................291 mkl_?skysm..........................................................................295 mkl_?dnscsr..........................................................................298 mkl_?csrcoo..........................................................................301 mkl_?csrbsr..........................................................................304 mkl_?csrcsc..........................................................................307 mkl_?csrdia..........................................................................309 mkl_?csrsky..........................................................................313 mkl_?csradd.........................................................................316 mkl_?csrmultcsr....................................................................320 mkl_?csrmultd......................................................................324 BLAS-like Extensions..............................................................................327 ?axpby.........................................................................................327 ?gem2vu......................................................................................329 ?gem2vc.......................................................................................331 ?gemm3m....................................................................................333 mkl_?imatcopy..............................................................................335 mkl_?omatcopy.............................................................................338 mkl_?omatcopy2...........................................................................341 mkl_?omatadd...............................................................................344 Chapter 3: LAPACK Routines: Linear Equations Routine Naming Conventions...................................................................347 C Interface Conventions..........................................................................348 Fortran 95 Interface Conventions.............................................................351 Intel® MKL Fortran 95 Interfaces for LAPACK Routines vs. Netlib Implementation.........................................................................352 Matrix Storage Schemes.........................................................................353 Mathematical Notation............................................................................354 Error Analysis........................................................................................354 Computational Routines..........................................................................355 Routines for Matrix Factorization......................................................357 ?getrf...................................................................................357 ?gbtrf...................................................................................359 ?gttrf...................................................................................361 ?dttrfb..................................................................................363 ?potrf...................................................................................364 ?pstrf...................................................................................366 ?pftrf...................................................................................368 ?pptrf...................................................................................369 ?pbtrf...................................................................................371 ?pttrf...................................................................................373 ?sytrf...................................................................................374 ?hetrf...................................................................................378 ?sptrf...................................................................................381 ?hptrf...................................................................................383 Routines for Solving Systems of Linear Equations...............................385 ?getrs..................................................................................385 Intel® Math Kernel Library Reference Manual 6 ?gbtrs..................................................................................387 ?gttrs...................................................................................389 ?dttrsb.................................................................................392 ?potrs..................................................................................393 ?pftrs...................................................................................395 ?pptrs..................................................................................396 ?pbtrs..................................................................................398 ?pttrs...................................................................................400 ?sytrs...................................................................................402 ?hetrs..................................................................................404 ?sytrs2.................................................................................406 ?hetrs2................................................................................408 ?sptrs..................................................................................409 ?hptrs..................................................................................411 ?trtrs...................................................................................413 ?tptrs...................................................................................416 ?tbtrs...................................................................................418 Routines for Estimating the Condition Number...................................420 ?gecon.................................................................................420 ?gbcon.................................................................................422 ?gtcon..................................................................................424 ?pocon.................................................................................426 ?ppcon.................................................................................428 ?pbcon.................................................................................430 ?ptcon..................................................................................432 ?sycon.................................................................................434 ?syconv................................................................................436 ?hecon.................................................................................438 ?spcon.................................................................................439 ?hpcon.................................................................................441 ?trcon..................................................................................443 ?tpcon..................................................................................445 ?tbcon..................................................................................447 Refining the Solution and Estimating Its Error....................................449 ?gerfs..................................................................................449 ?gerfsx.................................................................................452 ?gbrfs..................................................................................458 ?gbrfsx.................................................................................461 ?gtrfs...................................................................................467 ?porfs..................................................................................469 ?porfsx.................................................................................472 ?pprfs..................................................................................478 ?pbrfs..................................................................................480 ?ptrfs...................................................................................483 ?syrfs...................................................................................485 ?syrfsx.................................................................................488 ?herfs..................................................................................494 ?herfsx.................................................................................496 ?sprfs...................................................................................501 ?hprfs..................................................................................504 Contents 7 ?trrfs...................................................................................506 ?tprfs...................................................................................508 ?tbrfs...................................................................................511 Routines for Matrix Inversion...........................................................514 ?getri...................................................................................514 ?potri...................................................................................516 ?pftri....................................................................................517 ?pptri...................................................................................519 ?sytri...................................................................................520 ?hetri...................................................................................522 ?sytri2.................................................................................523 ?hetri2.................................................................................525 ?sytri2x................................................................................527 ?hetri2x...............................................................................529 ?sptri...................................................................................530 ?hptri...................................................................................532 ?trtri....................................................................................534 ?tftri....................................................................................535 ?tptri...................................................................................536 Routines for Matrix Equilibration......................................................538 ?geequ.................................................................................538 ?geequb...............................................................................540 ?gbequ.................................................................................542 ?gbequb...............................................................................545 ?poequ.................................................................................547 ?poequb...............................................................................549 ?ppequ.................................................................................550 ?pbequ.................................................................................552 ?syequb...............................................................................554 ?heequb...............................................................................556 Driver Routines......................................................................................557 ?gesv...........................................................................................558 ?gesvx.........................................................................................561 ?gesvxx........................................................................................567 ?gbsv...........................................................................................574 ?gbsvx.........................................................................................576 ?gbsvxx........................................................................................582 ?gtsv............................................................................................589 ?gtsvx..........................................................................................591 ?dtsvb..........................................................................................595 ?posv...........................................................................................596 ?posvx.........................................................................................599 ?posvxx........................................................................................604 ?ppsv...........................................................................................611 ?ppsvx.........................................................................................612 ?pbsv...........................................................................................617 ?pbsvx.........................................................................................619 ?ptsv............................................................................................623 ?ptsvx..........................................................................................625 ?sysv...........................................................................................629 Intel® Math Kernel Library Reference Manual 8 ?sysvx..........................................................................................631 ?sysvxx........................................................................................635 ?hesv...........................................................................................642 ?hesvx.........................................................................................645 ?hesvxx........................................................................................649 ?spsv...........................................................................................655 ?spsvx..........................................................................................657 ?hpsv...........................................................................................661 ?hpsvx.........................................................................................663 Chapter 4: LAPACK Routines: Least Squares and Eigenvalue Problems Routine Naming Conventions...................................................................668 Matrix Storage Schemes.........................................................................669 Mathematical Notation............................................................................669 Computational Routines..........................................................................669 Orthogonal Factorizations................................................................670 ?geqrf..................................................................................671 ?geqrfp................................................................................674 ?geqpf..................................................................................676 ?geqp3.................................................................................678 ?orgqr..................................................................................681 ?ormqr.................................................................................683 ?ungqr.................................................................................685 ?unmqr................................................................................687 ?gelqf..................................................................................689 ?orglq..................................................................................692 ?ormlq.................................................................................694 ?unglq..................................................................................696 ?unmlq.................................................................................698 ?geqlf..................................................................................700 ?orgql..................................................................................702 ?ungql..................................................................................704 ?ormql.................................................................................706 ?unmql.................................................................................708 ?gerqf..................................................................................710 ?orgrq..................................................................................712 ?ungrq.................................................................................714 ?ormrq.................................................................................716 ?unmrq................................................................................718 ?tzrzf...................................................................................720 ?ormrz.................................................................................723 ?unmrz................................................................................725 ?ggqrf..................................................................................728 ?ggrqf..................................................................................731 Singular Value Decomposition..........................................................734 ?gebrd.................................................................................736 ?gbbrd.................................................................................739 ?orgbr..................................................................................742 ?ormbr.................................................................................744 ?ungbr.................................................................................747 Contents 9 ?unmbr................................................................................749 ?bdsqr..................................................................................752 ?bdsdc.................................................................................756 Symmetric Eigenvalue Problems......................................................758 ?sytrd..................................................................................762 ?syrdb..................................................................................764 ?herdb.................................................................................766 ?orgtr..................................................................................768 ?ormtr.................................................................................770 ?hetrd..................................................................................772 ?ungtr..................................................................................775 ?unmtr.................................................................................776 ?sptrd..................................................................................779 ?opgtr..................................................................................781 ?opmtr.................................................................................782 ?hptrd..................................................................................784 ?upgtr..................................................................................786 ?upmtr.................................................................................787 ?sbtrd..................................................................................789 ?hbtrd..................................................................................791 ?sterf...................................................................................793 ?steqr..................................................................................795 ?stemr.................................................................................798 ?stedc..................................................................................801 ?stegr..................................................................................805 ?pteqr..................................................................................810 ?stebz..................................................................................813 ?stein...................................................................................815 ?disna..................................................................................818 Generalized Symmetric-Definite Eigenvalue Problems.........................819 ?sygst..................................................................................820 ?hegst..................................................................................822 ?spgst..................................................................................823 ?hpgst..................................................................................825 ?sbgst..................................................................................827 ?hbgst..................................................................................829 ?pbstf..................................................................................831 Nonsymmetric Eigenvalue Problems.................................................833 ?gehrd.................................................................................835 ?orghr..................................................................................837 ?ormhr.................................................................................839 ?unghr.................................................................................842 ?unmhr................................................................................844 ?gebal..................................................................................847 ?gebak.................................................................................849 ?hseqr..................................................................................851 ?hsein..................................................................................855 ?trevc..................................................................................860 ?trsna..................................................................................864 ?trexc..................................................................................868 Intel® Math Kernel Library Reference Manual 10 ?trsen..................................................................................870 ?trsyl...................................................................................874 Generalized Nonsymmetric Eigenvalue Problems................................877 ?gghrd.................................................................................878 ?ggbal..................................................................................880 ?ggbak.................................................................................883 ?hgeqz.................................................................................885 ?tgevc..................................................................................890 ?tgexc..................................................................................894 ?tgsen..................................................................................896 ?tgsyl...................................................................................902 ?tgsna..................................................................................906 Generalized Singular Value Decomposition........................................910 ?ggsvp.................................................................................910 ?tgsja..................................................................................914 Cosine-Sine Decomposition.............................................................919 ?bbcsd.................................................................................920 ?orbdb/?unbdb......................................................................925 Driver Routines......................................................................................930 Linear Least Squares (LLS) Problems................................................930 ?gels....................................................................................930 ?gelsy..................................................................................933 ?gelss..................................................................................937 ?gelsd..................................................................................939 Generalized LLS Problems...............................................................943 ?gglse..................................................................................943 ?ggglm.................................................................................946 Symmetric Eigenproblems...............................................................948 ?syev...................................................................................949 ?heev...................................................................................951 ?syevd.................................................................................954 ?heevd.................................................................................956 ?syevx.................................................................................959 ?heevx.................................................................................963 ?syevr..................................................................................966 ?heevr.................................................................................970 ?spev...................................................................................975 ?hpev...................................................................................977 ?spevd.................................................................................979 ?hpevd.................................................................................981 ?spevx.................................................................................985 ?hpevx.................................................................................988 ?sbev...................................................................................991 ?hbev...................................................................................993 ?sbevd.................................................................................995 ?hbevd.................................................................................998 ?sbevx...............................................................................1001 ?hbevx...............................................................................1004 ?stev..................................................................................1008 ?stevd................................................................................1009 Contents 11 ?stevx................................................................................1012 ?stevr.................................................................................1015 Nonsymmetric Eigenproblems........................................................1019 ?gees.................................................................................1020 ?geesx...............................................................................1024 ?geev.................................................................................1028 ?geevx...............................................................................1032 Singular Value Decomposition........................................................1037 ?gesvd...............................................................................1037 ?gesdd...............................................................................1041 ?gejsv................................................................................1045 ?gesvj................................................................................1051 ?ggsvd...............................................................................1055 Cosine-Sine Decomposition............................................................1060 ?orcsd/?uncsd.....................................................................1060 Generalized Symmetric Definite Eigenproblems................................1065 ?sygv.................................................................................1066 ?hegv.................................................................................1068 ?sygvd...............................................................................1071 ?hegvd...............................................................................1074 ?sygvx...............................................................................1077 ?hegvx...............................................................................1081 ?spgv.................................................................................1085 ?hpgv.................................................................................1087 ?spgvd...............................................................................1089 ?hpgvd...............................................................................1092 ?spgvx...............................................................................1096 ?hpgvx...............................................................................1099 ?sbgv.................................................................................1103 ?hbgv.................................................................................1105 ?sbgvd...............................................................................1107 ?hbgvd...............................................................................1110 ?sbgvx...............................................................................1113 ?hbgvx...............................................................................1117 Generalized Nonsymmetric Eigenproblems.......................................1120 ?gges.................................................................................1121 ?ggesx...............................................................................1126 ?ggev.................................................................................1132 ?ggevx...............................................................................1136 Chapter 5: LAPACK Auxiliary and Utility Routines Auxiliary Routines.................................................................................1143 ?lacgv.........................................................................................1155 ?lacrm........................................................................................1156 ?lacrt..........................................................................................1156 ?laesy.........................................................................................1157 ?rot............................................................................................1158 ?spmv........................................................................................1159 ?spr...........................................................................................1161 ?symv........................................................................................1162 Intel® Math Kernel Library Reference Manual 12 ?syr............................................................................................1163 i?max1.......................................................................................1164 ?sum1........................................................................................1165 ?gbtf2.........................................................................................1166 ?gebd2.......................................................................................1167 ?gehd2.......................................................................................1168 ?gelq2........................................................................................1170 ?geql2........................................................................................1171 ?geqr2........................................................................................1172 ?geqr2p......................................................................................1174 ?gerq2........................................................................................1175 ?gesc2........................................................................................1176 ?getc2........................................................................................1177 ?getf2.........................................................................................1178 ?gtts2.........................................................................................1179 ?isnan........................................................................................1180 ?laisnan......................................................................................1181 ?labrd.........................................................................................1181 ?lacn2........................................................................................1184 ?lacon.........................................................................................1185 ?lacpy.........................................................................................1186 ?ladiv.........................................................................................1187 ?lae2..........................................................................................1188 ?laebz.........................................................................................1189 ?laed0........................................................................................1192 ?laed1........................................................................................1194 ?laed2........................................................................................1195 ?laed3........................................................................................1197 ?laed4........................................................................................1199 ?laed5........................................................................................1200 ?laed6........................................................................................1200 ?laed7........................................................................................1202 ?laed8........................................................................................1204 ?laed9........................................................................................1207 ?laeda........................................................................................1208 ?laein.........................................................................................1209 ?laev2........................................................................................1212 ?laexc.........................................................................................1213 ?lag2..........................................................................................1214 ?lags2........................................................................................1216 ?lagtf..........................................................................................1218 ?lagtm........................................................................................1220 ?lagts.........................................................................................1221 ?lagv2........................................................................................1223 ?lahqr.........................................................................................1224 ?lahrd.........................................................................................1226 ?lahr2.........................................................................................1228 ?laic1.........................................................................................1230 ?laln2.........................................................................................1232 ?lals0.........................................................................................1234 Contents 13 ?lalsa..........................................................................................1236 ?lalsd.........................................................................................1239 ?lamrg........................................................................................1241 ?laneg........................................................................................1242 ?langb........................................................................................1243 ?lange........................................................................................1244 ?langt.........................................................................................1245 ?lanhs........................................................................................1246 ?lansb........................................................................................1247 ?lanhb........................................................................................1248 ?lansp........................................................................................1249 ?lanhp........................................................................................1250 ?lanst/?lanht...............................................................................1251 ?lansy.........................................................................................1252 ?lanhe........................................................................................1253 ?lantb.........................................................................................1255 ?lantp.........................................................................................1256 ?lantr.........................................................................................1257 ?lanv2........................................................................................1259 ?lapll..........................................................................................1259 ?lapmr........................................................................................1260 ?lapmt........................................................................................1262 ?lapy2........................................................................................1262 ?lapy3........................................................................................1263 ?laqgb........................................................................................1264 ?laqge........................................................................................1265 ?laqhb........................................................................................1266 ?laqp2........................................................................................1268 ?laqps........................................................................................1269 ?laqr0.........................................................................................1270 ?laqr1.........................................................................................1273 ?laqr2.........................................................................................1274 ?laqr3.........................................................................................1277 ?laqr4.........................................................................................1280 ?laqr5.........................................................................................1282 ?laqsb........................................................................................1285 ?laqsp........................................................................................1286 ?laqsy.........................................................................................1287 ?laqtr.........................................................................................1289 ?lar1v.........................................................................................1290 ?lar2v.........................................................................................1293 ?larf...........................................................................................1294 ?larfb.........................................................................................1295 ?larfg.........................................................................................1298 ?larfgp........................................................................................1299 ?larft..........................................................................................1300 ?larfx..........................................................................................1302 ?largv.........................................................................................1304 ?larnv.........................................................................................1305 ?larra.........................................................................................1306 Intel® Math Kernel Library Reference Manual 14 ?larrb.........................................................................................1307 ?larrc..........................................................................................1309 ?larrd.........................................................................................1310 ?larre.........................................................................................1312 ?larrf..........................................................................................1315 ?larrj..........................................................................................1317 ?larrk.........................................................................................1318 ?larrr..........................................................................................1319 ?larrv.........................................................................................1320 ?lartg.........................................................................................1323 ?lartgp........................................................................................1324 ?lartgs........................................................................................1326 ?lartv.........................................................................................1327 ?laruv.........................................................................................1328 ?larz...........................................................................................1329 ?larzb.........................................................................................1330 ?larzt..........................................................................................1332 ?las2..........................................................................................1334 ?lascl..........................................................................................1335 ?lasd0........................................................................................1336 ?lasd1........................................................................................1338 ?lasd2........................................................................................1340 ?lasd3........................................................................................1342 ?lasd4........................................................................................1344 ?lasd5........................................................................................1346 ?lasd6........................................................................................1347 ?lasd7........................................................................................1350 ?lasd8........................................................................................1353 ?lasd9........................................................................................1354 ?lasda.........................................................................................1356 ?lasdq........................................................................................1358 ?lasdt.........................................................................................1360 ?laset.........................................................................................1361 ?lasq1........................................................................................1362 ?lasq2........................................................................................1363 ?lasq3........................................................................................1364 ?lasq4........................................................................................1365 ?lasq5........................................................................................1366 ?lasq6........................................................................................1367 ?lasr...........................................................................................1368 ?lasrt..........................................................................................1371 ?lassq.........................................................................................1372 ?lasv2.........................................................................................1373 ?laswp........................................................................................1374 ?lasy2.........................................................................................1375 ?lasyf.........................................................................................1377 ?lahef.........................................................................................1378 ?latbs.........................................................................................1380 ?latdf..........................................................................................1382 ?latps.........................................................................................1383 Contents 15 ?latrd.........................................................................................1385 ?latrs..........................................................................................1387 ?latrz..........................................................................................1390 ?lauu2........................................................................................1392 ?lauum.......................................................................................1393 ?org2l/?ung2l..............................................................................1394 ?org2r/?ung2r.............................................................................1395 ?orgl2/?ungl2..............................................................................1396 ?orgr2/?ungr2.............................................................................1397 ?orm2l/?unm2l............................................................................1399 ?orm2r/?unm2r...........................................................................1400 ?orml2/?unml2............................................................................1402 ?ormr2/?unmr2...........................................................................1404 ?ormr3/?unmr3...........................................................................1405 ?pbtf2.........................................................................................1407 ?potf2.........................................................................................1408 ?ptts2.........................................................................................1409 ?rscl...........................................................................................1411 ?syswapr....................................................................................1411 ?heswapr....................................................................................1413 ?sygs2/?hegs2.............................................................................1415 ?sytd2/?hetd2.............................................................................1417 ?sytf2.........................................................................................1418 ?hetf2.........................................................................................1419 ?tgex2........................................................................................1421 ?tgsy2........................................................................................1423 ?trti2..........................................................................................1426 clag2z.........................................................................................1427 dlag2s........................................................................................1427 slag2d........................................................................................1428 zlag2c.........................................................................................1429 ?larfp.........................................................................................1429 ila?lc..........................................................................................1431 ila?lr...........................................................................................1432 ?gsvj0........................................................................................1432 ?gsvj1........................................................................................1434 ?sfrk...........................................................................................1437 ?hfrk..........................................................................................1438 ?tfsm..........................................................................................1440 ?lansf.........................................................................................1442 ?lanhf.........................................................................................1443 ?tfttp..........................................................................................1444 ?tfttr..........................................................................................1445 ?tpttf..........................................................................................1446 ?tpttr..........................................................................................1448 ?trttf..........................................................................................1449 ?trttp..........................................................................................1450 ?pstf2.........................................................................................1451 dlat2s ........................................................................................1453 zlat2c ........................................................................................1454 Intel® Math Kernel Library Reference Manual 16 ?lacp2........................................................................................1455 ?la_gbamv..................................................................................1455 ?la_gbrcond................................................................................1457 ?la_gbrcond_c.............................................................................1459 ?la_gbrcond_x.............................................................................1460 ?la_gbrfsx_extended....................................................................1462 ?la_gbrpvgrw...............................................................................1467 ?la_geamv..................................................................................1468 ?la_gercond.................................................................................1470 ?la_gercond_c.............................................................................1471 ?la_gercond_x.............................................................................1472 ?la_gerfsx_extended.....................................................................1473 ?la_heamv..................................................................................1478 ?la_hercond_c.............................................................................1480 ?la_hercond_x.............................................................................1481 ?la_herfsx_extended....................................................................1482 ?la_herpvgrw...............................................................................1487 ?la_lin_berr.................................................................................1488 ?la_porcond................................................................................1489 ?la_porcond_c.............................................................................1490 ?la_porcond_x.............................................................................1492 ?la_porfsx_extended....................................................................1493 ?la_porpvgrw...............................................................................1498 ?laqhe........................................................................................1499 ?laqhp........................................................................................1501 ?larcm........................................................................................1502 ?la_rpvgrw..................................................................................1503 ?larscl2.......................................................................................1504 ?lascl2........................................................................................1504 ?la_syamv...................................................................................1505 ?la_syrcond.................................................................................1507 ?la_syrcond_c..............................................................................1508 ?la_syrcond_x.............................................................................1509 ?la_syrfsx_extended.....................................................................1511 ?la_syrpvgrw...............................................................................1516 ?la_wwaddw................................................................................1517 Utility Functions and Routines................................................................1518 ilaver..........................................................................................1519 ilaenv.........................................................................................1520 iparmq........................................................................................1522 ieeeck.........................................................................................1523 lsamen.......................................................................................1524 ?labad........................................................................................1524 ?lamch.......................................................................................1525 ?lamc1.......................................................................................1526 ?lamc2.......................................................................................1526 ?lamc3.......................................................................................1527 ?lamc4.......................................................................................1528 ?lamc5.......................................................................................1528 second/dsecnd.............................................................................1529 Contents 17 chla_transtype.............................................................................1529 iladiag........................................................................................1530 ilaprec........................................................................................1531 ilatrans.......................................................................................1531 ilauplo........................................................................................1532 xerbla_array................................................................................1532 Chapter 6: ScaLAPACK Routines Overview.............................................................................................1535 Routine Naming Conventions.................................................................1536 Computational Routines........................................................................1537 Linear Equations..........................................................................1537 Routines for Matrix Factorization....................................................1538 p?getrf...............................................................................1538 p?gbtrf...............................................................................1540 p?dbtrf...............................................................................1542 p?dttrf................................................................................1543 p?potrf...............................................................................1545 p?pbtrf...............................................................................1546 p?pttrf................................................................................1548 Routines for Solving Systems of Linear Equations.............................1550 p?getrs...............................................................................1550 p?gbtrs...............................................................................1551 p?dbtrs...............................................................................1553 p?dttrs...............................................................................1555 p?potrs...............................................................................1557 p?pbtrs...............................................................................1558 p?pttrs...............................................................................1560 p?trtrs................................................................................1562 Routines for Estimating the Condition Number..................................1563 p?gecon..............................................................................1564 p?pocon..............................................................................1566 p?trcon...............................................................................1568 Refining the Solution and Estimating Its Error..................................1570 p?gerfs...............................................................................1570 p?porfs...............................................................................1573 p?trrfs................................................................................1576 Routines for Matrix Inversion.........................................................1578 p?getri...............................................................................1578 p?potri...............................................................................1580 p?trtri.................................................................................1581 Routines for Matrix Equilibration.....................................................1583 p?geequ.............................................................................1583 p?poequ.............................................................................1584 Orthogonal Factorizations..............................................................1586 p?geqrf...............................................................................1587 p?geqpf..............................................................................1589 p?orgqr..............................................................................1591 p?ungqr..............................................................................1592 p?ormqr.............................................................................1594 Intel® Math Kernel Library Reference Manual 18 p?unmqr.............................................................................1596 p?gelqf...............................................................................1598 p?orglq...............................................................................1600 p?unglq..............................................................................1602 p?ormlq..............................................................................1603 p?unmlq.............................................................................1605 p?geqlf...............................................................................1608 p?orgql...............................................................................1609 p?ungql..............................................................................1611 p?ormql..............................................................................1612 p?unmql.............................................................................1615 p?gerqf...............................................................................1617 p?orgrq..............................................................................1619 p?ungrq..............................................................................1620 p?ormrq.............................................................................1622 p?unmrq.............................................................................1624 p?tzrzf................................................................................1626 p?ormrz..............................................................................1628 p?unmrz.............................................................................1631 p?ggqrf...............................................................................1633 p?ggrqf...............................................................................1636 Symmetric Eigenproblems.............................................................1640 p?sytrd...............................................................................1640 p?ormtr..............................................................................1643 p?hetrd..............................................................................1646 p?unmtr.............................................................................1648 p?stebz..............................................................................1651 p?stein...............................................................................1653 Nonsymmetric Eigenvalue Problems................................................1656 p?gehrd..............................................................................1657 p?ormhr.............................................................................1659 p?unmhr.............................................................................1662 p?lahqr...............................................................................1664 Singular Value Decomposition........................................................1666 p?gebrd..............................................................................1666 p?ormbr.............................................................................1669 p?unmbr.............................................................................1672 Generalized Symmetric-Definite Eigen Problems...............................1676 p?sygst...............................................................................1676 p?hegst..............................................................................1677 Driver Routines....................................................................................1679 p?gesv........................................................................................1679 p?gesvx......................................................................................1681 p?gbsv........................................................................................1685 p?dbsv........................................................................................1687 p?dtsv........................................................................................1689 p?posv........................................................................................1691 p?posvx......................................................................................1693 p?pbsv........................................................................................1697 p?ptsv........................................................................................1699 Contents 19 p?gels.........................................................................................1701 p?syev........................................................................................1704 p?syevd......................................................................................1706 p?syevx......................................................................................1708 p?heev.......................................................................................1713 p?heevd......................................................................................1715 p?heevx......................................................................................1717 p?gesvd......................................................................................1723 p?sygvx......................................................................................1726 p?hegvx......................................................................................1732 Chapter 7: ScaLAPACK Auxiliary and Utility Routines Auxiliary Routines.................................................................................1739 p?lacgv.......................................................................................1743 p?max1......................................................................................1744 ?combamax1...............................................................................1745 p?sum1......................................................................................1745 p?dbtrsv.....................................................................................1746 p?dttrsv......................................................................................1748 p?gebd2......................................................................................1751 p?gehd2.....................................................................................1754 p?gelq2......................................................................................1756 p?geql2......................................................................................1758 p?geqr2......................................................................................1760 p?gerq2......................................................................................1762 p?getf2.......................................................................................1763 p?labrd.......................................................................................1765 p?lacon.......................................................................................1768 p?laconsb....................................................................................1769 p?lacp2.......................................................................................1770 p?lacp3.......................................................................................1772 p?lacpy.......................................................................................1773 p?laevswp...................................................................................1774 p?lahrd.......................................................................................1775 p?laiect.......................................................................................1778 p?lange.......................................................................................1779 p?lanhs.......................................................................................1780 p?lansy, p?lanhe..........................................................................1782 p?lantr........................................................................................1783 p?lapiv........................................................................................1785 p?laqge.......................................................................................1787 p?laqsy.......................................................................................1789 p?lared1d....................................................................................1791 p?lared2d....................................................................................1792 p?larf.........................................................................................1793 p?larfb........................................................................................1795 p?larfc........................................................................................1798 p?larfg........................................................................................1800 p?larft........................................................................................1802 p?larz.........................................................................................1804 Intel® Math Kernel Library Reference Manual 20 p?larzb.......................................................................................1807 p?larzc........................................................................................1809 p?larzt........................................................................................1813 p?lascl........................................................................................1815 p?laset.......................................................................................1817 p?lasmsub...................................................................................1818 p?lassq.......................................................................................1819 p?laswp......................................................................................1821 p?latra........................................................................................1822 p?latrd........................................................................................1823 p?latrs........................................................................................1826 p?latrz........................................................................................1828 p?lauu2......................................................................................1830 p?lauum.....................................................................................1831 p?lawil........................................................................................1832 p?org2l/p?ung2l...........................................................................1833 p?org2r/p?ung2r..........................................................................1835 p?orgl2/p?ungl2...........................................................................1836 p?orgr2/p?ungr2..........................................................................1838 p?orm2l/p?unm2l.........................................................................1840 p?orm2r/p?unm2r........................................................................1843 p?orml2/p?unml2.........................................................................1846 p?ormr2/p?unmr2........................................................................1849 p?pbtrsv.....................................................................................1851 p?pttrsv......................................................................................1854 p?potf2.......................................................................................1857 p?rscl.........................................................................................1858 p?sygs2/p?hegs2.........................................................................1859 p?sytd2/p?hetd2..........................................................................1861 p?trti2........................................................................................1864 ?lamsh.......................................................................................1866 ?laref..........................................................................................1867 ?lasorte......................................................................................1868 ?lasrt2........................................................................................1869 ?stein2.......................................................................................1870 ?dbtf2.........................................................................................1872 ?dbtrf.........................................................................................1873 ?dttrf..........................................................................................1874 ?dttrsv........................................................................................1875 ?pttrsv........................................................................................1876 ?steqr2.......................................................................................1878 Utility Functions and Routines................................................................1879 p?labad.......................................................................................1879 p?lachkieee.................................................................................1880 p?lamch......................................................................................1881 p?lasnbt......................................................................................1882 pxerbla.......................................................................................1882 Chapter 8: Sparse Solver Routines PARDISO* - Parallel Direct Sparse Solver Interface...................................1885 Contents 21 pardiso.......................................................................................1886 pardisoinit...................................................................................1902 pardiso_64..................................................................................1903 pardiso_getenv, pardiso_setenv.....................................................1904 PARDISO Parameters in Tabular Form.............................................1905 Direct Sparse Solver (DSS) Interface Routines.........................................1914 DSS Interface Description.............................................................1916 DSS Routines..............................................................................1916 dss_create..........................................................................1916 dss_define_structure............................................................1918 dss_reorder.........................................................................1920 dss_factor_real, dss_factor_complex......................................1921 dss_solve_real, dss_solve_complex........................................1923 dss_delete..........................................................................1926 dss_statistics.......................................................................1927 mkl_cvt_to_null_terminated_str............................................1930 Implementation Details.................................................................1931 Iterative Sparse Solvers based on Reverse Communication Interface (RCI ISS)...............................................................................................1932 CG Interface Description...............................................................1933 FGMRES Interface Description........................................................1938 RCI ISS Routines.........................................................................1945 dcg_init..............................................................................1945 dcg_check...........................................................................1946 dcg....................................................................................1946 dcg_get..............................................................................1948 dcgmrhs_init.......................................................................1948 dcgmrhs_check....................................................................1949 dcgmrhs.............................................................................1950 dcgmrhs_get.......................................................................1952 dfgmres_init........................................................................1952 dfgmres_check....................................................................1953 dfgmres..............................................................................1954 dfgmres_get........................................................................1956 Implementation Details.................................................................1957 Preconditioners based on Incomplete LU Factorization Technique................1958 ILU0 and ILUT Preconditioners Interface Description.........................1960 dcsrilu0.......................................................................................1961 dcsrilut.......................................................................................1963 Calling Sparse Solver and Preconditioner Routines from C/C++..................1967 Chapter 9: Vector Mathematical Functions Data Types, Accuracy Modes, and Performance Tips..................................1969 Function Naming Conventions................................................................1970 Function Interfaces.......................................................................1971 VML Mathematical Functions..................................................1971 Pack Functions....................................................................1971 Unpack Functions.................................................................1972 Service Functions.................................................................1972 Input Parameters.................................................................1972 Intel® Math Kernel Library Reference Manual 22 Output Parameters...............................................................1973 Vector Indexing Methods.......................................................................1973 Error Diagnostics..................................................................................1973 VML Mathematical Functions..................................................................1974 Special Value Notations.................................................................1976 Arithmetic Functions.....................................................................1976 v?Add.................................................................................1976 v?Sub.................................................................................1979 v?Sqr.................................................................................1981 v?Mul.................................................................................1983 v?MulByConj.......................................................................1986 v?Conj................................................................................1987 v?Abs.................................................................................1989 v?Arg.................................................................................1991 v?LinearFrac........................................................................1993 Power and Root Functions.............................................................1995 v?Inv.................................................................................1995 v?Div.................................................................................1997 v?Sqrt................................................................................2000 v?InvSqrt............................................................................2002 v?Cbrt................................................................................2004 v?InvCbrt...........................................................................2006 v?Pow2o3...........................................................................2007 v?Pow3o2...........................................................................2009 v?Pow................................................................................2011 v?Powx...............................................................................2014 v?Hypot..............................................................................2017 Exponential and Logarithmic Functions............................................2019 v?Exp.................................................................................2019 v?Expm1............................................................................2022 v?Ln...................................................................................2024 v?Log10.............................................................................2027 v?Log1p..............................................................................2030 Trigonometric Functions................................................................2031 v?Cos.................................................................................2031 v?Sin..................................................................................2034 v?SinCos............................................................................2036 v?CIS.................................................................................2038 v?Tan.................................................................................2040 v?Acos...............................................................................2042 v?Asin................................................................................2045 v?Atan................................................................................2047 v?Atan2..............................................................................2050 Hyperbolic Functions.....................................................................2052 v?Cosh...............................................................................2052 v?Sinh................................................................................2055 v?Tanh...............................................................................2058 v?Acosh..............................................................................2061 v?Asinh..............................................................................2064 v?Atanh..............................................................................2067 Contents 23 Special Functions.........................................................................2070 v?Erf..................................................................................2070 v?Erfc.................................................................................2073 v?CdfNorm..........................................................................2075 v?ErfInv.............................................................................2077 v?ErfcInv............................................................................2080 v?CdfNormInv.....................................................................2082 v?LGamma..........................................................................2084 v?TGamma.........................................................................2086 Rounding Functions......................................................................2088 v?Floor...............................................................................2088 v?Ceil.................................................................................2089 v?Trunc..............................................................................2091 v?Round.............................................................................2093 v?NearbyInt........................................................................2094 v?Rint................................................................................2096 v?Modf...............................................................................2098 VML Pack/Unpack Functions...................................................................2100 v?Pack........................................................................................2100 v?Unpack....................................................................................2103 VML Service Functions...........................................................................2106 vmlSetMode................................................................................2106 vmlGetMode................................................................................2108 vmlSetErrStatus...........................................................................2109 vmlGetErrStatus..........................................................................2110 vmlClearErrStatus........................................................................2111 vmlSetErrorCallBack.....................................................................2111 vmlGetErrorCallBack.....................................................................2114 vmlClearErrorCallBack..................................................................2114 Chapter 10: Statistical Functions Random Number Generators..................................................................2115 Conventions................................................................................2116 Mathematical Notation..........................................................2117 Naming Conventions............................................................2118 Basic Generators..........................................................................2121 BRNG Parameter Definition....................................................2122 Random Streams.................................................................2123 Data Types.........................................................................2124 Error Reporting............................................................................2124 VSL RNG Usage Model..................................................................2125 Service Routines..........................................................................2127 vslNewStream.....................................................................2128 vslNewStreamEx..................................................................2129 vsliNewAbstractStream.........................................................2131 vsldNewAbstractStream........................................................2133 vslsNewAbstractStream........................................................2135 vslDeleteStream..................................................................2137 vslCopyStream....................................................................2138 vslCopyStreamState.............................................................2139 Intel® Math Kernel Library Reference Manual 24 vslSaveStreamF...................................................................2140 vslLoadStreamF...................................................................2141 vslSaveStreamM..................................................................2142 vslLoadStreamM..................................................................2144 vslGetStreamSize.................................................................2145 vslLeapfrogStream...............................................................2146 vslSkipAheadStream............................................................2148 vslGetStreamStateBrng........................................................2151 vslGetNumRegBrngs.............................................................2152 Distribution Generators.................................................................2153 Continuous Distributions.......................................................2156 Discrete Distributions...........................................................2189 Advanced Service Routines............................................................2208 Data types..........................................................................2208 vslRegisterBrng...................................................................2209 vslGetBrngProperties............................................................2210 Formats for User-Designed Generators...................................2211 Convolution and Correlation...................................................................2214 Naming Conventions.....................................................................2215 Data Types..................................................................................2215 Parameters.................................................................................2216 Task Status and Error Reporting.....................................................2218 Task Constructors........................................................................2220 vslConvNewTask/vslCorrNewTask...........................................2220 vslConvNewTask1D/vslCorrNewTask1D...................................2223 vslConvNewTaskX/vslCorrNewTaskX.......................................2225 vslConvNewTaskX1D/vslCorrNewTaskX1D...............................2228 Task Editors................................................................................2232 vslConvSetMode/vslCorrSetMode...........................................2232 vslConvSetInternalPrecision/vslCorrSetInternalPrecision............2234 vslConvSetStart/vslCorrSetStart............................................2235 vslConvSetDecimation/vslCorrSetDecimation...........................2237 Task Execution Routines................................................................2238 vslConvExec/vslCorrExec......................................................2239 vslConvExec1D/vslCorrExec1D...............................................2242 vslConvExecX/vslCorrExecX...................................................2246 vslConvExecX1D/vslCorrExecX1D...........................................2249 Task Destructors..........................................................................2253 vslConvDeleteTask/vslCorrDeleteTask.....................................2253 Task Copy...................................................................................2254 vslConvCopyTask/vslCorrCopyTask.........................................2254 Usage Examples...........................................................................2256 Mathematical Notation and Definitions............................................2258 Data Allocation............................................................................2259 VSL Summary Statistics........................................................................2261 Naming Conventions.....................................................................2262 Data Types..................................................................................2263 Parameters.................................................................................2263 Task Status and Error Reporting.....................................................2263 Task Constructors........................................................................2267 Contents 25 vslSSNewTask.....................................................................2267 Task Editors................................................................................2269 vslSSEditTask......................................................................2270 vslSSEditMoments................................................................2278 vslSSEditCovCor..................................................................2280 vslSSEditPartialCovCor.........................................................2282 vslSSEditQuantiles...............................................................2284 vslSSEditStreamQuantiles.....................................................2286 vslSSEditPooledCovariance....................................................2287 vslSSEditRobustCovariance...................................................2289 vslSSEditOutliersDetection....................................................2292 vslSSEditMissingValues.........................................................2294 vslSSEditCorParameterization................................................2298 Task Computation Routines...........................................................2300 vslSSCompute.....................................................................2302 Task Destructor...........................................................................2303 vslSSDeleteTask..................................................................2303 Usage Examples...........................................................................2304 Mathematical Notation and Definitions............................................2305 Chapter 11: Fourier Transform Functions FFT Functions.......................................................................................2312 Computing an FFT........................................................................2313 FFT Interface...............................................................................2313 Descriptor Manipulation Functions..................................................2313 DftiCreateDescriptor.............................................................2314 DftiCommitDescriptor...........................................................2316 DftiFreeDescriptor................................................................2317 DftiCopyDescriptor...............................................................2318 FFT Computation Functions............................................................2319 DftiComputeForward............................................................2320 DftiComputeBackward..........................................................2322 Descriptor Configuration Functions.................................................2325 DftiSetValue........................................................................2325 DftiGetValue........................................................................2327 Status Checking Functions.............................................................2329 DftiErrorClass......................................................................2329 DftiErrorMessage.................................................................2331 Configuration Settings..................................................................2332 DFTI_PRECISION.................................................................2334 DFTI_FORWARD_DOMAIN.....................................................2335 DFTI_DIMENSION, DFTI_LENGTHS.........................................2336 DFTI_PLACEMENT................................................................2336 DFTI_FORWARD_SCALE, DFTI_BACKWARD_SCALE...................2336 DFTI_NUMBER_OF_USER_THREADS.......................................2336 DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES......................2337 DFTI_NUMBER_OF_TRANSFORMS..........................................2339 DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE..................2339 DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE, DFTI_CONJUGATE_EVEN_STORAGE....................................2340 Intel® Math Kernel Library Reference Manual 26 DFTI_PACKED_FORMAT........................................................2347 DFTI_WORKSPACE...............................................................2351 DFTI_COMMIT_STATUS........................................................2352 DFTI_ORDERING..................................................................2352 Cluster FFT Functions............................................................................2352 Computing Cluster FFT..................................................................2353 Distributing Data among Processes.................................................2354 Cluster FFT Interface....................................................................2356 Descriptor Manipulation Functions..................................................2356 DftiCreateDescriptorDM........................................................2357 DftiCommitDescriptorDM.......................................................2358 DftiFreeDescriptorDM...........................................................2359 FFT Computation Functions............................................................2360 DftiComputeForwardDM........................................................2360 DftiComputeBackwardDM......................................................2362 Descriptor Configuration Functions.................................................2364 DftiSetValueDM...................................................................2365 DftiGetValueDM...................................................................2367 Error Codes.................................................................................2370 Chapter 12: PBLAS Routines Overview.............................................................................................2373 Routine Naming Conventions.................................................................2374 PBLAS Level 1 Routines.........................................................................2375 p?amax......................................................................................2376 p?asum.......................................................................................2377 p?axpy.......................................................................................2378 p?copy........................................................................................2379 p?dot..........................................................................................2380 p?dotc........................................................................................2381 p?dotu........................................................................................2382 p?nrm2.......................................................................................2383 p?scal.........................................................................................2384 p?swap.......................................................................................2385 PBLAS Level 2 Routines.........................................................................2386 p?gemv......................................................................................2387 p?agemv.....................................................................................2389 p?ger..........................................................................................2391 p?gerc........................................................................................2393 p?geru........................................................................................2394 p?hemv......................................................................................2396 p?ahemv.....................................................................................2397 p?her.........................................................................................2399 p?her2........................................................................................2400 p?symv.......................................................................................2402 p?asymv.....................................................................................2404 p?syr..........................................................................................2406 p?syr2........................................................................................2407 p?trmv.......................................................................................2409 p?atrmv......................................................................................2410 Contents 27 p?trsv.........................................................................................2413 PBLAS Level 3 Routines.........................................................................2414 p?geadd......................................................................................2415 p?tradd.......................................................................................2416 p?gemm.....................................................................................2418 p?hemm.....................................................................................2420 p?herk........................................................................................2422 p?her2k......................................................................................2424 p?symm......................................................................................2426 p?syrk........................................................................................2428 p?syr2k......................................................................................2430 p?tran........................................................................................2432 p?tranu.......................................................................................2433 p?tranc.......................................................................................2434 p?trmm......................................................................................2435 p?trsm........................................................................................2437 Chapter 13: Partial Differential Equations Support Trigonometric Transform Routines..........................................................2441 Transforms Implemented..............................................................2442 Sequence of Invoking TT Routines..................................................2443 Interface Description....................................................................2445 TT Routines.................................................................................2445 ?_init_trig_transform............................................................2445 ?_commit_trig_transform......................................................2446 ?_forward_trig_transform.....................................................2448 ?_backward_trig_transform...................................................2450 free_trig_transform..............................................................2451 Common Parameters....................................................................2452 Implementation Details.................................................................2455 Poisson Library Routines .......................................................................2457 Poisson Library Implemented.........................................................2457 Sequence of Invoking PL Routines..................................................2462 Interface Description....................................................................2464 PL Routines for the Cartesian Solver...............................................2465 ?_init_Helmholtz_2D/?_init_Helmholtz_3D..............................2465 ?_commit_Helmholtz_2D/?_commit_Helmholtz_3D..................2467 ?_Helmholtz_2D/?_Helmholtz_3D..........................................2470 free_Helmholtz_2D/free_Helmholtz_3D...................................2474 PL Routines for the Spherical Solver................................................2475 ?_init_sph_p/?_init_sph_np...................................................2475 ?_commit_sph_p/?_commit_sph_np.......................................2476 ?_sph_p/?_sph_np...............................................................2478 free_sph_p/free_sph_np.......................................................2480 Common Parameters....................................................................2481 Implementation Details.................................................................2486 Calling PDE Support Routines from Fortran 90..........................................2492 Chapter 14: Nonlinear Optimization Problem Solvers Organization and Implementation...........................................................2495 Intel® Math Kernel Library Reference Manual 28 Routine Naming Conventions.................................................................2496 Nonlinear Least Squares Problem without Constraints................................2496 ?trnlsp_init..................................................................................2497 ?trnlsp_check..............................................................................2499 ?trnlsp_solve...............................................................................2500 ?trnlsp_get..................................................................................2502 ?trnlsp_delete..............................................................................2503 Nonlinear Least Squares Problem with Linear (Bound) Constraints..............2504 ?trnlspbc_init...............................................................................2505 ?trnlspbc_check...........................................................................2506 ?trnlspbc_solve............................................................................2508 ?trnlspbc_get...............................................................................2510 ?trnlspbc_delete..........................................................................2511 Jacobian Matrix Calculation Routines.......................................................2512 ?jacobi_init..................................................................................2512 ?jacobi_solve...............................................................................2513 ?jacobi_delete.............................................................................2514 ?jacobi........................................................................................2515 ?jacobix......................................................................................2516 Chapter 15: Support Functions Version Information Functions................................................................2521 mkl_get_version..........................................................................2521 mkl_get_version_string.................................................................2523 Threading Control Functions...................................................................2524 mkl_set_num_threads..................................................................2524 mkl_domain_set_num_threads......................................................2525 mkl_set_dynamic.........................................................................2526 mkl_get_max_threads..................................................................2526 mkl_domain_get_max_threads......................................................2527 mkl_get_dynamic.........................................................................2528 Error Handling Functions.......................................................................2528 xerbla.........................................................................................2529 pxerbla.......................................................................................2530 Equality Test Functions.........................................................................2530 lsame.........................................................................................2530 lsamen.......................................................................................2531 Timing Functions..................................................................................2532 second/dsecnd.............................................................................2532 mkl_get_cpu_clocks.....................................................................2533 mkl_get_cpu_frequency................................................................2534 mkl_get_max_cpu_frequency........................................................2534 mkl_get_clocks_frequency.............................................................2535 Memory Functions................................................................................2536 mkl_free_buffers..........................................................................2536 mkl_thread_free_buffers...............................................................2537 mkl_disable_fast_mm...................................................................2538 mkl_mem_stat............................................................................2538 mkl_malloc..................................................................................2539 mkl_free.....................................................................................2540 Contents 29 Examples of mkl_malloc(), mkl_free(), mkl_mem_stat() Usage..........2540 Miscellaneous Utility Functions...............................................................2542 mkl_progress...............................................................................2542 mkl_enable_instructions................................................................2544 Functions Supporting the Single Dynamic Library......................................2545 mkl_set_interface_layer................................................................2545 mkl_set_threading_layer...............................................................2546 mkl_set_xerbla............................................................................2546 mkl_set_progress.........................................................................2547 Chapter 16: BLACS Routines Matrix Shapes......................................................................................2549 BLACS Combine Operations...................................................................2550 ?gamx2d.....................................................................................2551 ?gamn2d.....................................................................................2552 ?gsum2d.....................................................................................2553 BLACS Point To Point Communication......................................................2554 ?gesd2d......................................................................................2556 ?trsd2d.......................................................................................2557 ?gerv2d......................................................................................2557 ?trrv2d.......................................................................................2558 BLACS Broadcast Routines.....................................................................2559 ?gebs2d......................................................................................2560 ?trbs2d.......................................................................................2560 ?gebr2d......................................................................................2561 ?trbr2d.......................................................................................2562 BLACS Support Routines........................................................................2562 Initialization Routines...................................................................2562 blacs_pinfo.........................................................................2563 blacs_setup.........................................................................2563 blacs_get............................................................................2564 blacs_set............................................................................2565 blacs_gridinit.......................................................................2566 blacs_gridmap.....................................................................2567 Destruction Routines....................................................................2568 blacs_freebuff.....................................................................2568 blacs_gridexit......................................................................2569 blacs_abort.........................................................................2569 blacs_exit...........................................................................2569 Informational Routines..................................................................2570 blacs_gridinfo......................................................................2570 blacs_pnum........................................................................2570 blacs_pcoord.......................................................................2571 Miscellaneous Routines.................................................................2571 blacs_barrier.......................................................................2571 Examples of BLACS Routines Usage........................................................2572 Chapter 17: Data Fitting Functions Naming Conventions.............................................................................2581 Data Types..........................................................................................2582 Intel® Math Kernel Library Reference Manual 30 Mathematical Conventions.....................................................................2582 Data Fitting Usage Model.......................................................................2585 Data Fitting Usage Examples..................................................................2585 Task Status and Error Reporting.............................................................2590 Task Creation and Initialization Routines..................................................2592 df?newtask1d..............................................................................2592 Task Editors.........................................................................................2594 df?editppspline1d.........................................................................2595 df?editptr....................................................................................2601 dfieditval.....................................................................................2602 df?editidxptr................................................................................2604 Computational Routines........................................................................2606 df?construct1d.............................................................................2606 df?interpolate1d/df?interpolateex1d................................................2607 df?integrate1d/df?integrateex1d.....................................................2613 df?searchcells1d/df?searchcellsex1d...............................................2619 df?interpcallback..........................................................................2621 df?integrcallback..........................................................................2623 df?searchcellscallback...................................................................2625 Task Destructors..................................................................................2627 dfdeletetask................................................................................2627 Appendix A: Linear Solvers Basics Sparse Linear Systems..........................................................................2629 Matrix Fundamentals....................................................................2629 Direct Method..............................................................................2630 Sparse Matrix Storage Formats......................................................2634 Appendix B: Routine and Function Arguments Vector Arguments in BLAS.....................................................................2645 Vector Arguments in VML......................................................................2646 Matrix Arguments.................................................................................2646 Appendix C: Code Examples BLAS Code Examples............................................................................2653 Fourier Transform Functions Code Examples............................................2656 FFT Code Examples......................................................................2656 Examples of Using Multi-Threading for FFT Computation............2662 Examples for Cluster FFT Functions.................................................2666 Auxiliary Data Transformations......................................................2667 Appendix D: CBLAS Interface to the BLAS CBLAS Arguments................................................................................2669 Level 1 CBLAS......................................................................................2670 Level 2 CBLAS......................................................................................2672 Level 3 CBLAS......................................................................................2676 Sparse CBLAS......................................................................................2678 Appendix E: Specific Features of Fortran 95 Interfaces for LAPACK Routines Interfaces Identical to Netlib..................................................................2681 Contents 31 Interfaces with Replaced Argument Names..............................................2682 Modified Netlib Interfaces......................................................................2684 Interfaces Absent From Netlib................................................................2684 Interfaces of New Functionality...............................................................2687 Appendix F: FFTW Interface to Intel® Math Kernel Library Notational Conventions ........................................................................2689 FFTW2 Interface to Intel® Math Kernel Library .........................................2689 Wrappers Reference.....................................................................2689 One-dimensional Complex-to-complex FFTs ............................2689 Multi-dimensional Complex-to-complex FFTs............................2690 One-dimensional Real-to-half-complex/Half-complex-to-real FFTs...............................................................................2690 Multi-dimensional Real-to-complex/Complex-to-real FFTs..........2690 Multi-threaded FFTW............................................................2691 FFTW Support Functions.......................................................2691 Limitations of the FFTW2 Interface to Intel MKL.......................2691 Calling Wrappers from Fortran.......................................................2692 Installation..................................................................................2693 Creating the Wrapper Library.................................................2693 Application Assembling ........................................................2694 Running Examples ...............................................................2694 MPI FFTW Wrappers.....................................................................2694 MPI FFTW Wrappers Reference..............................................2694 Creating MPI FFTW Wrapper Library.......................................2696 Application Assembling with MPI FFTW Wrapper Library............2696 Running Examples ...............................................................2696 FFTW3 Interface to Intel® Math Kernel Library..........................................2697 Using FFTW3 Wrappers.................................................................2697 Calling Wrappers from Fortran.......................................................2699 Building Your Own Wrapper Library.................................................2699 Building an Application..................................................................2700 Running Examples .......................................................................2700 MPI FFTW Wrappers.....................................................................2701 Building Your Own Wrapper Library........................................2701 Building an Application.........................................................2701 Running Examples...............................................................2702 Appendix G: Bibliography Appendix H: Glossary Intel® Math Kernel Library Reference Manual 32 Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http:// www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/ processor_number/ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java is a registered trademark of Oracle and/or its affiliates. Third Party Content Intel® Math Kernel Library (Intel® MKL) includes content from several 3rd party sources that was originally governed by the licenses referenced below: • Portions© Copyright 2001 Hewlett-Packard Development Company, L.P. 33 • Sections on the Linear Algebra PACKage (LAPACK) routines include derivative work portions that have been copyrighted: © 1991, 1992, and 1998 by The Numerical Algorithms Group, Ltd. • Intel MKL fully supports LAPACK 3.3 set of computational, driver, auxiliary and utility routines under the following license: Copyright © 1992-2010 The University of Tennessee. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution. • Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. • The original versions of the Basic Linear Algebra Subprograms (BLAS) from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/blas/index.html. • The original versions of the Basic Linear Algebra Communication Subprograms (BLACS) from which the respective part of Intel MKL was derived can be obtained from http://www.netlib.org/blacs/index.html. The authors of BLACS are Jack Dongarra and R. Clint Whaley. • The original versions of Scalable LAPACK (ScaLAPACK) from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. • The original versions of the Parallel Basic Linear Algebra Subprograms (PBLAS) routines from which the respective part of Intel® MKL was derived can be obtained from http://www.netlib.org/scalapack/html/ pblas_qref.html. • PARDISO (PARallel DIrect SOlver)* in Intel® MKL is compliant with the 3.2 release of PARDISO that is freely distributed by the University of Basel. It can be obtained at http://www.pardiso-project.org. • Some Fast Fourier Transform (FFT) functions in this release of Intel® MKL have been generated by the SPIRAL software generation system (http://www.spiral.net/) under license from Carnegie Mellon University. The authors of SPIRAL are Markus Puschel, Jose Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. Copyright© 1994-2011, Intel Corporation. All rights reserved. Intel® Math Kernel Library Reference Manual 34 Introducing the Intel® Math Kernel Library The Intel® Math Kernel Library (Intel® MKL) improves performance of scientific, engineering, and financial software that solves large computational problems. Among other functionality, Intel MKL provides linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores (see the Intel® MKL Release Notes for the full list of supported processors). Intel MKL also performs well on non-Intel processors. Intel MKL is thread-safe and extensively threaded using the OpenMP* technology. For more details about functionality provided by Intel MKL, see the Function Domains section. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 35 Intel® Math Kernel Library Reference Manual 36 Getting Help and Support Getting Help The online version of the Intel® Math Kernel Library (Intel® MKL) Reference Manual integrates into the Microsoft Visual Studio* development system help on Windows* OS or into the Eclipse* development system help on Linux* OS. For information on how to use the online help, see the Intel MKL User's Guide. Getting Technical Support Intel MKL provides a product web site that offers timely and comprehensive product information, including product features, white papers, and technical articles. For the latest information, check: http:// www.intel.com/software/products/support. Intel also provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more (visit http://www.intel.com/software/products/). Registering your product entitles you to one year of technical support and product updates through Intel® Premier Support. Intel Premier Support is an interactive issue management and communication web site providing these services: • Submit issues and review their status. • Download product updates anytime of the day. To register your product, contact Intel, or seek product support, please visit http://www.intel.com/software/ products/support. 37 Intel® Math Kernel Library Reference Manual 38 What's New This Reference Manual documents Intel® Math Kernel Library (Intel® MKL) 10.3 Update 8 release. The following function domains were updated in Intel MKL 10.3 Update 8 with new functions, enhancements to the existing functionality, or improvements to the existing documentation: • New data fitting functions provide spline-based interpolation capabilities that you can use to approximate functions, function derivatives or function integrals, and perform cell search operations. See Data Fitting Functions. • The Fourier transform documentation has been updated and improved, especially in the descriptions of configuration settings that define the forward domain of the transform (see DFTI_FORWARD_DOMAIN), memory layout of the input/output data (see DFTI_INPUT_STRIDES, DFTI_OUTPUT_STRIDES), distances between consecutive data sets for computing multiple transforms (see DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE), and storage schemes (see DFTI_COMPLEX_STORAGE, DFTI_REAL_STORAGE). Additionally, several minor updates have been made to correct errors in the manual. 39 Intel® Math Kernel Library Reference Manual 40 Notational Conventions This manual uses the following terms to refer to operating systems: Windows* OS This term refers to information that is valid on all supported Windows* operating systems. Linux* OS This term refers to information that is valid on all supported Linux* operating systems. Mac OS* X This term refers to information that is valid on Intel®-based systems running the Mac OS* X operating system. This manual uses the following notational conventions: • Routine name shorthand (for example, ?ungqr instead of cungqr/zungqr). • Font conventions used for distinction between the text and the code. Routine Name Shorthand For shorthand, names that contain a question mark "?" represent groups of routines with similar functionality. Each group typically consists of routines used with four basic data types: single-precision real, double-precision real, single-precision complex, and double-precision complex. The question mark is used to indicate any or all possible varieties of a function; for example: ?swap Refers to all four data types of the vector-vector ?swap routine: sswap, dswap, cswap, and zswap. Font Conventions The following font conventions are used: UPPERCASE COURIER Data type used in the description of input and output parameters for Fortran interface. For example, CHARACTER*1. lowercase courier Code examples: a(k+i,j) = matrix(i,j) and data types for C interface, for example, const float* lowercase courier mixed with UpperCase courier Function names for C interface, for example, vmlSetMode lowercase courier italic Variables in arguments and parameters description. For example, incx. * Used as a multiplication symbol in code examples and equations and where required by the Fortran syntax. 41 Intel® Math Kernel Library Reference Manual 42 Function Domains 1 The Intel® Math Kernel Library includes Fortran routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. In addition to the Fortran interface, Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions. For hardware and software requirements to use Intel MKL, see Intel® MKL Release Notes. The Intel® Math Kernel Library includes the following groups of routines: • Basic Linear Algebra Subprograms (BLAS): – vector operations – matrix-vector operations – matrix-matrix operations • Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices) • LAPACK routines for solving systems of linear equations • LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations • Auxiliary and utility LAPACK routines • ScaLAPACK computational, driver and auxiliary routines (only in Intel MKL for Linux* and Windows* operating systems) • PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation • Direct and Iterative Sparse Solver routines • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces) • Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations • General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms and having Fortran and C interfaces • Cluster FFT functions (only in Intel MKL for Linux* and Windows* operating systems) • Tools for solving partial differential equations - trigonometric transform routines and Poisson solver • Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences • Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface • Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search • GMP arithmetic functions For specific issues on using the library, also see the Intel® MKL Release Notes. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 43 BLAS Routines The BLAS routines and functions are divided into the following groups according to the operations they perform: • BLAS Level 1 Routines perform operations of both addition and reduction on vectors of data. Typical operations include scaling and dot products. • BLAS Level 2 Routines perform matrix-vector operations, such as matrix-vector multiplication, rank-1 and rank-2 matrix updates, and solution of triangular systems. • BLAS Level 3 Routines perform matrix-matrix operations, such as matrix-matrix multiplication, rank-k update, and solution of triangular systems. Starting from release 8.0, Intel® MKL also supports the Fortran 95 interface to the BLAS routines. Starting from release 10.1, a number of BLAS-like Extensions are added to enable the user to perform certain data manipulation, including matrix in-place and out-of-place transposition operations combined with simple matrix arithmetic operations. Sparse BLAS Routines The Sparse BLAS Level 1 Routines and Functions and Sparse BLAS Level 2 and Level 3 Routines routines and functions operate on sparse vectors and matrices. These routines perform vector operations similar to the BLAS Level 1, 2, and 3 routines. The Sparse BLAS routines take advantage of vector and matrix sparsity: they allow you to store only non-zero elements of vectors and matrices. Intel MKL also supports Fortran 95 interface to Sparse BLAS routines. LAPACK Routines The Intel® Math Kernel Library fully supports LAPACK 3.1 set of computational, driver, auxiliary and utility routines. The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http:// www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. The LAPACK routines can be divided into the following groups according to the operations they perform: • Routines for solving systems of linear equations, factoring and inverting matrices, and estimating condition numbers (see Chapter 3). • Routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations (see Chapter 4). • Auxiliary and utility routines used to perform certain subtasks, common low-level computation or related tasks (see Chapter 5). Starting from release 8.0, Intel MKL also supports the Fortran 95 interface to LAPACK computational and driver routines. This interface provides an opportunity for simplified calls of LAPACK routines with fewer required arguments. ScaLAPACK Routines The ScaLAPACK package (included only with the Intel® MKL versions for Linux* and Windows* operating systems, see Chapter 6 and Chapter 7) runs on distributed-memory architectures and includes routines for solving systems of linear equations, solving linear least squares problems, eigenvalue and singular value problems, as well as performing a number of related computational tasks. The original versions of ScaLAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are L. Blackford, J. Choi, A.Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K.Stanley, D. Walker, and R. Whaley. 1 Intel® Math Kernel Library Reference Manual 44 The Intel MKL version of ScaLAPACK is optimized for Intel® processors and uses MPICH version of MPI as well as Intel MPI. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 PBLAS Routines The PBLAS routines perform operations with distributed vectors and matrices. • PBLAS Level 1 Routines perform operations of both addition and reduction on vectors of data. Typical operations include scaling and dot products. • PBLAS Level 2 Routines perform distributed matrix-vector operations, such as matrix-vector multiplication, rank-1 and rank-2 matrix updates, and solution of triangular systems. • PBLAS Level 3 Routines perform distributed matrix-matrix operations, such as matrix-matrix multiplication, rank-k update, and solution of triangular systems. Intel MKL provides the PBLAS routines with interface similar to the interface used in the Netlib PBLAS (part of the ScaLAPACK package, see http://www.netlib.org/scalapack/html/pblas_qref.html). Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Sparse Solver Routines Direct sparse solver routines in Intel MKL (see Chapter 8) solve symmetric and symmetrically-structured sparse matrices with real or complex coefficients. For symmetric matrices, these Intel MKL subroutines can solve both positive-definite and indefinite systems. Intel MKL includes the PARDISO* sparse solver interface as well as an alternative set of user callable direct sparse solver routines. If you use the sparse solver PARDISO* from Intel MKL, please cite: O.Schenk and K.Gartner. Solving unsymmetric sparse systems of linear equations with PARDISO. J. of Future Generation Computer Systems, 20(3):475-487, 2004. Intel MKL provides also an iterative sparse solver (see Chapter 8) that uses Sparse BLAS level 2 and 3 routines and works with different sparse data formats. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for Function Domains 1 45 Optimization Notice use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 VML Functions The Vector Mathematical Library (VML) functions (see Chapter 9) include a set of highly optimized implementations of certain computationally expensive core mathematical functions (power, trigonometric, exponential, hyperbolic, etc.) that operate on vectors of real and complex numbers. Application programs that might significantly improve performance with VML include nonlinear programming software, integrals computation, and many others. VML provides interfaces both for Fortran and C languages. Statistical Functions The Vector Statistical Library (VSL) contains three sets of functions (see Chapter 10): • The first set includes a collection of pseudo- and quasi-random number generator subroutines implementing basic continuous and discrete distributions. To provide best performance, the VSL subroutines use calls to highly optimized Basic Random Number Generators (BRNGs) and a library of vector mathematical functions. • The second set includes a collection of routines that implement a wide variety of convolution and correlation operations. • The third set includes a collection of routines for initial statistical analysis of raw single and double precision multi-dimensional datasets. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Fourier Transform Functions The Intel® MKL multidimensional Fast Fourier Transform (FFT) functions with mixed radix support (see Chapter 11) provide uniformity of discrete Fourier transform computation and combine functionality with ease of use. Both Fortran and C interface specification are given. There is also a cluster version of FFT functions, which runs on distributed-memory architectures and is provided only in Intel MKL versions for the Linux* and Windows* operating systems. The FFT functions provide fast computation via the FFT algorithms for arbitrary lengths. See the Intel® MKL User's Guide for the specific radices supported. Partial Differential Equations Support Intel® MKL provides tools for solving Partial Differential Equations (PDE) (see Chapter 13). These tools are Trigonometric Transform interface routines and Poisson Library. 1 Intel® Math Kernel Library Reference Manual 46 The Trigonometric Transform routines may be helpful to users who implement their own solvers similar to the solver that the Poisson Library provides. The users can improve performance of their solvers by using fast sine, cosine, and staggered cosine transforms implemented in the Trigonometric Transform interface. The Poisson Library is designed for fast solving of simple Helmholtz, Poisson, and Laplace problems. The Trigonometric Transform interface, which underlies the solver, is based on the Intel MKL FFT interface (refer to Chapter 11), optimized for Intel® processors. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Nonlinear Optimization Problem Solvers Intel® MKL provides Nonlinear Optimization Problem Solver routines (see Chapter 14) that can be used to solve nonlinear least squares problems with or without linear (bound) constraints through the Trust-Region (TR) algorithms and compute Jacobi matrix by central differences. Support Functions The Intel® MKL support functions (see Chapter 15) are used to support the operation of the Intel MKL software and provide basic information on the library and library operation, such as the current library version, timing, setting and measuring of CPU frequency, error handling, and memory allocation. Starting from release 10.0, the Intel MKL support functions provide additional threading control. Starting from release 10.1, Intel MKL selectively supports a Progress Routine feature to track progress of a lengthy computation and/or interrupt the computation using a callback function mechanism. The user application can define a function called mkl_progress that is regularly called from the Intel MKL routine supporting the progress routine feature. See the Progress Routines section in Chapter 15 for reference. Refer to a specific LAPACK or DSS/PARDISO function description to see whether the function supports this feature or not. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 BLACS Routines The Intel® Math Kernel Library implements routines from the BLACS (Basic Linear Algebra Communication Subprograms) package (see Chapter 16) that are used to support a linear algebra oriented message passing interface that may be implemented efficiently and uniformly across a large range of distributed memory platforms. The original versions of BLACS from which that part of Intel MKL was derived can be obtained from http:// www.netlib.org/blacs/index.html. The authors of BLACS are Jack Dongarra and R. Clint Whaley. Function Domains 1 47 Data Fitting Functions The Data Fitting component includes a set of highly-optimized implementations of algorithms for the following spline-based computations: • spline construction • interpolation including computation of derivatives and integration • search The algorithms operate on single and double vector-valued functions set in the points of the given partition. You can use Data Fitting algorithms in applications that are based on data approximation. GMP Arithmetic Functions Intel® MKL implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision (GMP*) Arithmetic Library. NOTE GMP Arithmetic Functions are deprecated and will be removed in a future Intel MKL release. Performance Enhancements The Intel® Math Kernel Library has been optimized by exploiting both processor and system features and capabilities. Special care has been given to those routines that most profit from cache-management techniques. These especially include matrix-matrix operation routines such as dgemm(). In addition, code optimization techniques have been applied to minimize dependencies of scheduling integer and floating-point units on the results within the processor. The major optimization techniques used throughout the library include: • Loop unrolling to minimize loop management costs • Blocking of data to improve data reuse opportunities • Copying to reduce chances of data eviction from cache • Data prefetching to help hide memory latency • Multiple simultaneous operations (for example, dot products in dgemm) to eliminate stalls due to arithmetic unit pipelines • Use of hardware features such as the SIMD arithmetic units, where appropriate These are techniques from which the arithmetic code benefits the most. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 Intel® Math Kernel Library Reference Manual 48 Parallelism In addition to the performance enhancements discussed above, Intel® MKL offers performance gains through parallelism provided by the symmetric multiprocessing performance (SMP) feature. You can obtain improvements from SMP in the following ways: • One way is based on user-managed threads in the program and further distribution of the operations over the threads based on data decomposition, domain decomposition, control decomposition, or some other parallelizing technique. Each thread can use any of the Intel MKL functions (except for the deprecated ? lacon LAPACK routine) because the library has been designed to be thread-safe. • Another method is to use the FFT and BLAS level 3 routines. They have been parallelized and require no alterations of your application to gain the performance enhancements of multiprocessing. Performance using multiple processors on the level 3 BLAS shows excellent scaling. Since the threads are called and managed within the library, the application does not need to be recompiled thread-safe (see also Fortran 95 Interface Conventions in Chapter 2 ). • Yet another method is to use tuned LAPACK routines. Currently these include the single- and double precision flavors of routines for QR factorization of general matrices, triangular factorization of general and symmetric positive-definite matrices, solving systems of equations with such matrices, as well as solving symmetric eigenvalue problems. For instructions on setting the number of available processors for the BLAS level 3 and LAPACK routines, see Intel® MKL User's Guide. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 C Datatypes Specific to Intel MKL The mkl_types.h file defines datatypes specific to Intel MKL. C/C++ Type Fortran Type LP32 Equivalent (Size in Bytes) LP64 Equivalent (Size in Bytes) ILP64 Equivalent (Size in Bytes) MKL_INT (MKL integer) INTEGER (default INTEGER) C/C++: int Fortran: INTEGER*4 (4 bytes) C/C++: int Fortran: INTEGER*4 (4 bytes) C/C++: long long (or define MKL_ILP64 macros Fortran: INTEGER*8 (8 bytes) MKL_UINT (MKL unsigned integer) N/A C/C++: unsigned int (4 bytes) C/C++: unsigned int (4 bytes) C/C++: unsigned long long (8 bytes) MKL_LONG (MKL long integer) N/A C/C++: long (4 bytes) C/C++: long (Windows: 4 bytes) (Linux, Mac: 8 bytes) C/C++: long (8 bytes) Function Domains 1 49 C/C++ Type Fortran Type LP32 Equivalent (Size in Bytes) LP64 Equivalent (Size in Bytes) ILP64 Equivalent (Size in Bytes) MKL_Complex8 (Like C99 complex float) COMPLEX*8 (8 bytes) (8 bytes) (8 bytes) MKL_Complex16 (Like C99 complex double) COMPLEX*16 (16 bytes) (16 bytes) (16 bytes) You can redefine datatypes specific to Intel MKL. One reason to do this is if you have your own types which are binary-compatible with Intel MKL datatypes, with the same representation or memory layout. To redefine a datatype, use one of these methods: • Insert the #define statement redefining the datatype before the mkl.h header file #include statement. For example, #define MKL_INT size_t #include "mkl.h" • Use the compiler -D option to redefine the datatype. For example, ...-DMKL_INT=size_t... NOTE As the user, if you redefine Intel MKL datatypes you are responsible for making sure that your definition is compatible with that of Intel MKL. If not, it might cause unpredictable results or crash the application. 1 Intel® Math Kernel Library Reference Manual 50 BLAS and Sparse BLAS Routines 2 This chapter describes the Intel® Math Kernel Library implementation of the BLAS and Sparse BLAS routines, and BLAS-like extensions. The routine descriptions are arranged in several sections: • BLAS Level 1 Routines (vector-vector operations) • BLAS Level 2 Routines (matrix-vector operations) • BLAS Level 3 Routines (matrix-matrix operations) • Sparse BLAS Level 1 Routines (vector-vector operations). • Sparse BLAS Level 2 and Level 3 Routines (matrix-vector and matrix-matrix operations) • BLAS-like Extensions Each section presents the routine and function group descriptions in alphabetical order by routine or function group name; for example, the ?asum group, the ?axpy group. The question mark in the group name corresponds to different character codes indicating the data type (s, d, c, and z or their combination); see Routine Naming Conventions. When BLAS or Sparse BLAS routines encounter an error, they call the error reporting routine xerbla. In BLAS Level 1 groups i?amax and i?amin, an "i" is placed before the data-type indicator and corresponds to the index of an element in the vector. These groups are placed in the end of the BLAS Level 1 section. BLAS Routines Routine Naming Conventions BLAS routine names have the following structure: ( ) The field indicates the data type: s real, single precision c complex, single precision d real, double precision z complex, double precision Some routines and functions can have combined character codes, such as sc or dz. For example, the function scasum uses a complex input array and returns a real value. The field, in BLAS level 1, indicates the operation type. For example, the BLAS level 1 routines ? dot, ?rot, ?swap compute a vector dot product, vector rotation, and vector swap, respectively. In BLAS level 2 and 3, reflects the matrix argument type: ge general matrix gb general band matrix sy symmetric matrix sp symmetric matrix (packed storage) sb symmetric band matrix he Hermitian matrix hp Hermitian matrix (packed storage) 51 hb Hermitian band matrix tr triangular matrix tp triangular matrix (packed storage) tb triangular band matrix. The field, if present, provides additional details of the operation. BLAS level 1 names can have the following characters in the field: c conjugated vector u unconjugated vector g Givens rotation construction m modified Givens rotation mg modified Givens rotation construction BLAS level 2 names can have the following characters in the field: mv matrix-vector product sv solving a system of linear equations with a single unknown vector r rank-1 update of a matrix r2 rank-2 update of a matrix. BLAS level 3 names can have the following characters in the field: mm matrix-matrix product sm solving a system of linear equations with multiple unknown vectors rk rank-k update of a matrix r2k rank-2k update of a matrix. The examples below illustrate how to interpret BLAS routine names: ddot : double-precision real vector-vector dot product cdotc : complex vector-vector dot product, conjugated scasum : sum of magnitudes of vector elements, single precision real output and single precision complex input cdotu : vector-vector dot product, unconjugated, complex sgemv : matrix-vector product, general matrix, single precision ztrmm